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ocketNo.: 0039-7646-2RD 



COMMISSIONER FOR PATENTS 
ALEXANDRIA, VIRGINIA 22313 



Oblon 
Spivak 
McCleljland 



Maier 

8c 



Neustaot 



P.C. 



ATTORNEYS AT LAW 



RE: Application Serial No.: 09/532,535 
Applicants: Tatsunori KANAI, et al. 
Filing Date: March 22, 2000 

For: SCHEME FOR SYSTEMATICALLY REGISTERING 

META-DATA WITH RESPECT TO VARIOUS 

TYPES OF DATA 
Group Art Unit: 2151 

Examiner: F. JEAN 

SIR: 

Attached hereto for filing are the following papers: 

Petition Under 37 C.F.R. $ 1.181(A)(3) To Invoke The Supervisory Authority Of The 
Commissioner, Copy of Filing Receipt Date-Stamped 03/22/00, Copy of Information Disclosure 
Statement Filed 03/22/00, Copy of PTO-1449 Filed 03/22/00, Copy of Statement of Relevancy Filed 
03/22/00, Copy of Filing Receipt Date-Stamped 04/30/02, Copy of Information Disclosure Statement 
Filed 04/30/02, Copy of PTO-1449 Filed 04/30/02, Copy of Cited References (20) 

Our check in the amount of $0,00 is attached covering any required fees. In the event any 
variance exists between the amount enclosed and the Patent Office charges for filing the above-noted 
documents, including any fees required under 37 C.F.R 1.136 for any necessary Extension of Time to 
make the filing of the attached documents timely, please charge or credit the difference to our Deposit 
Account No. 15-0030. Further, if these papers are not considered timely filed, then a petition is hereby 
made under 37 C.F.R. 1.136 for the necessary extension of time. A duplicate copy of this sheet is 
enclosed. 



Customer Number 

22850 

(703)413-3000 (phone) 
(703)413-2220 (fax) 



Respectfully submitted, 




Eckhard H. Kuesters 
Registration No. 28,870 



1 940 DUKE STREET ALEXANDRIA, VIRGINIA 2231 4 U.S.A. 

Telephone: 703-41 3-3000 Facsimile: 703-41 3-2220 www.oblon.com 



BEST AVAILABLE COPY 




NO: 0039-7646-2RD 

IN THE UNITED STATES PATENT & TRADEMARK OFFICE 
_ PLICATION OF : 

TATSUNORI KANAI, ET AL. : EXAMINER: JEAN, F. 

SERIAL NO: 09/532,535 : 

FILED: MARCH 22, 2000 : GROUP ART UNIT: 2151 

FOR: SCHEME FOR SYSTEMATICALLY : 
REGISTERING META-DATA WITH 
RESPECT TO VARIOUS TYPES OF 
DATA 

PETITION UNDER 37 C.F.R. § 1.181(A)(3) 
TO INVOKE THE SUPERVISORY AUTHORITY OF THE COMMISSIONER 



COMMISSIONER FOR PATENTS 
ALEXANDRIA, VIRGINIA 22313 

SIR: 



Applicant herein petitions the Commissioner to invoke his supervisory authority to 
require the examiner to consider the prior art cited in the Information Disclosure Statements 
filed respectively on March 22, 2000, and April 30, 2002. 

An Information Disclosure Statement in conformity with the requirements of 37 
C.F.R. § 1.97 and § 1.98 was filed with the application on March 22, 2000 and, separately, on 
April 30, 2002. A copy of the Information Disclosure Statement, the List of References Cited 
by Applicants (i.e., PTO- Form 1449), the cited prior art references, and a date-stamped filing 
receipt are attached. The above-referenced application has now been allowed, however, the 
Information Disclosure Statements were never acknowledged or made of record by the 
examiner. 



Application No. 09/532,535 
Petition Under 37 C.F.R. § 1.181 

Thus, this Petition is being filed in order to require the Examiner to consider the 
references listed on the Information Disclosure Statement. 

Although Applicants do not believe that any fee is required for the present petition, 
any required fee should be charged the undersigned attorneys account no. 15-0030. 



Respectfully submitted, 




Customer Number 



22850 



Eckhard H. Kuesters 
Attorney of Record 
Registration No. 28,870 



Tel: (703)413-3000 
Fax: (703)413 -2220 
(OSMMN 06/04) 
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Dept.: PP; 
By: MJS/sk: 



□CPA 

■ Priority Doc (1) 

■ Dep. Acct. Order Form 



^OSM. ,/£N ^/ile No. 0039-7646-2RD 
; Serial No YNew Application 

LAn the matter of the Application of: Tatsunori KANAL et al. 

SCHEME FOR SYSTEMATICALLY REGISTERING META-DATA WTTH 
RESPECT TO VARIOUS TYPES OF DATA 

The following has been received in the U,S. Patent Office on the date stamped hereon: 

■ 38 ''pp^Specification & 20 Claims/Drawings 1 1 Sheets 

■ Combined Declaration, Petition & Power of Attorney 5 pages 
o List of Inventor Names and Addresses 

■ Utility Patent Application 

■ Notice of Priority 

■ Check for $846.00 

■ Fee Transmittal Form 

□ Assignment/PTO 1595 pages: 
a Letter to Official Draftsman 

□ Letter Requesting Approval of Drawing Chand 

□ Drawings sheets □ Formal 

□ Letter 

□ Amendment 

■ Information Disclosure Statement 

■ Cited References (7) 
o Search Report 

■ Statement of Relevancy 
a IDS/Related/List of Related Cases 
a Restriction Response 
a Rule 132 Declaration 

□ Petition for Extension of Time 

□ Notice of Appeal 

□ Brief 

a Issue Fee Transmittal 

■ White Advance Serial Number Card 
a 

a Due Date: 03/23/00 




□ Election Response 



CO** 




Docket No. 0039-7646-2RD 

IN STATES PATENT AND TRADEMARK OFFICE 

IN RE APPLICATION OF: / Tatsunori KANAIXet al. 

SERIAL NO: New Application - R ^ OA GAU: 



FILED: Herewith VS lj EXAMINER: 

FOR: SCHEME FORlS^TEMATICAraY REGISTERING META-DATA WITH RESPECT TO VARIOUS TYPES OF DATA 

INFORM ATIONDISCLOSURE STATEMENT UNDER 37 CFR 1.97 

ASSISTANT COMMISSIONER FOR PATENTS 
WASHINGTON, D.C 20231 

SIR: 

Applicant(s) wish to disclose the following information. 
REFERENCES 

■ The applicant(s) wish to make of record the references listed on the attached form PTO-1449. Copies of the listed 
references are attached, where required, as are either statements of relevancy or any readily available English translations of 
pertinent portions of any non-English language references. 

□ A check is attached in the amount required under 37 CFR §1.1 7(p). 
RELATED CASES 

□ Attached is a list of applicant's pending application(s) or issued patent(s) which may be related to the present application. 
A copy of the patent(s) is attached along with PTO 1449. 

□ A check is attached in the amount required under 37 CFR §1.1 7(p). 
CERTIFICATION 

□ Each item of information contained in this information disclosure statement was cited in a communication from a foreign 
patent office in a counterpart foreign application not more than three months prior to the filing of this statement. 

□ No item of information contained in this information disclosure statement was cited in a communication from a foreign 
patent office in a counterpart foreign application or, to the knowledge of the undersigned, having made reasonable inquiry, 
was known to any individual designated in 37 CFR § 1 .56(c) more than three months prior to the filing of this statement. 

PETITION 

□ Applicant(s) hereby request consideration of the attached information. A check is attached in the amount of the Petition fee 
required under 37 CFR § 1 . 1 7(i)( 1 ). 

DEPOSIT ACCOUNT 

■ Please charge any additional fees for the papers being filed herewith and for which no check is enclosed herewith, or credit 
any overpayment to deposit account number 1 5-0030 . A duplicate copy of this sheet is enclosed. 

Respectfully submitted, 

OBLON, SPIVAK, McCLELLAND, 




Fourth Floor Marvin J. Spivak 

1 755 Jefferson Davis Highway Registration No. 24,9 1 3 

Arlington, Virginia 22202 & 

Tel. (703)413-3000 

Fax. (703) 413-2220 

(OSMMN 10/98) 



SHEET 1 OF 1 



Form PTO 1449 
(Modified) 


U.S. DEPARTMENT OF COMMERCE 
PATENT AND TRADEMARK OFFICE 


ATTY DOCKET NO. 

0039-7646-2RD 


SERIAL NO. 

New Application 


LIST OF REFE 


^J©e4 ^W^y applicant 


APPLICANT 

Tatsunori KANAI, et al. 




JAN 2 6 20ffJ 3) 


FILING DATE 

Herewith 


GROUP 



OTHfiF REFERENCES (Including Author, Title f Date, Pertinent Pages, etc.) 



<®BBi|J5$^ 0L0, " PRACTICAL F,LE SYSTEM DESIGN WITH THE BE FILE SYSTEM', 1999, pgs. 65-97 



AB 



Y.Y. GOLAND. etal.. " HTTP EXTENTIONS FOR DISTRIBUTED AUTHORING-WEBDAV, February 1999, pgs. 1-71 



AC 



"RECORDING-HELICAL-SCAN DIGITAL VIDEO CASSETTE RECORDING SYSTEM USING 6,35 MM MAGNETIC TAPE 
FOR CONSUMER USE (525-60. 625-50, 1 1 25-60, 1 250-50 SYSTEMS) - PART 4: PACK HEADER TABLE AND 
CONTENTS", International Electrotechnical Commission, 1998, pgs.1-7, 116-147 



AD 



Saveen REDDY, et al., "DAV SEARCHING AND LOCATING", DAV Searching and Locating Protocol, June 3, 1999. 
pgs. 1-26 



AE 



"DIGITAL STILL CAMERA IMAGE FILE FORMAT STANDARD", Japan Electronic Industry Development Association 
Standard, June 1998, pgs. 17-69 



AF 



T. BERNERS-LEE, et al., "Hypertext Transfer Protocol-HTTP/1 .0", Network Working Group, May 1996, pgs. 1-60 



AG 



R. FIELDING, et al., "Hypertext Transfer Protocol-HTTP/1.1", Network Working Group. January 1997, pgs. 1-162 



AH 



Al 



AJ 



AK 



AL 



AM 



AN 



AO 



AP 



AQ 



Examiner 



Date Considered 



Examiner: Initial if reference is considered, whether or not citation is in conformance with MPEP 609; Draw line through citation if not in 
conformance and not considered. Include copy of this form with next communication to applicant. 




039-7646-2RD page _J_ of _]_ 



IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 

^RE APPLICATION OF: Tatsunori KANAI, et al. 
SERIAL NO.: New Application 
FILED: Herewith 

FOR: SCHEME FOR SYSTEMATICALLY REGISTERING META-DATA WITH 
RESPECT TO VARIOUS TYPES OF DATA 

STATEMENT OF RELEVANCY 
Reference AA on Form PTO-1449: 

This reference relates to a file system where each file is managed with meta data 
(attributes and values). But, it doesn't show automatic mechanism to put data type specific 
meta data. 

Reference AB on Form PTO-1449: 

This discloses standard interface to put/get meta data on a resource via HTTP 
protocol. 




BEST AVAILABLE COPY 



BEST AVAILABLE COPY 



/ 



Dept.: EE 
By: MJS/atp/rm 



OSMM&N File No. 0Q39-7646-2RD 
Serial No. 09/532,535 

In the matter of the Application of: Tatsunori KAN AT, et at. 

For: SCHEMF FOR SYSTEMATICALLY REGISTERING MFTA-DATA 

WTTH RESPECT TO VARTOT TS TVPFS OF DATA 



□CPA 

□ Priority Doc 

■ Dep. Acct. Order Form 



The following has been received in the U.S. Patent Office on the date stamped hereon: 

□ pp. Specification &Claims/Drawings Sheets 

□ Combined Declaration, Petition & Power of pages 
Attorney 

□ List of Inventor Names and Addresses 

□ Utility Patent Application 

□ Notice of Priority 

□ Check for ■ 

□ Fee Transmittal Form 

□ Assignment/PTO- 1595 pages: 
o Letter to Official Draftsman 

□ Letter Requesting Approval of Drawing Changes 

□ Drawings sheets □ Formal 

□ Letter 

□ Amendment 

■ Information Disclosure Statement ■ PTO-1449 

■ Cited References ( 12 ) 

■ EUROPEAN Search Report 
o Statement of Relevancy 

□ English Abstracts, Concise Explanation, English Translation, 
Partial English Translation ( ) 

□ IDS/Related/List of Related Cases 

□ Restriction Response 

□ Rule 1 32 Declaration 

□ Petition for Extension of Time 

□ Notice of Appeal 




□Cited Pending Applications ( ) 
□ Election Response 



Due date: 06/18/02 




Docket No. 003 9-7646-2RD/atp 

IN THE UNITED STA¥BS4>ATENT AND TRADEMARK OFFICE 
IN RE APPLICATION OF: Tatsunori KjUffAI, et al. <y\ 

SERIAL NO: 09/532,535 f Jty ^ *A GAU: 2755 

FILED: March 22, 2000 2 °tfi $ J EXAMINER: 

FOR: SCHEME FOR SYSTEMATICALLY REGIST^JjNG META-DATA WITH RESPECT TO VARIOUS TYPES OF DATA 

INFORMATION DISCLOSURE/KETATED CASE STATEMENT UNDER 37 CFR 1.97 

ASSISTANT COMMISSIONER FOR PATENTS 
WASHINGTON, D C. 20231 

SIR: 

Applicant(s) wish to disclose the following information. 
REFERENCES 

■ The applicants) wish to make of record the references cited in the attached European Search Report listed on the attached 
form PTO-1449. Copies of the listed references are attached, where required, as are either statements of relevancy or any 
readily available English translations of pertinent portions of any non-English language references. 

□ A check is attached in the amount required under 37 CFR §1.1 7(p). 

RELATED CASES 

□ Attached is a list of applicant's pending application(s) or issued patent(s) which may be related to the present application. 
A copy of the patent(s), together with a copy of the claims and drawings of the pending application(s) is attached along 
with PTO 1449. 

□ A check is attached in the amount required under 37 CFR §1.1 7(p). 
CERTIFICATION 

■ Each item of information contained in this information disclosure statement was first cited in any communication from a 
foreign patent office in a counterpart foreign application not more than three months prior to the filing of this statement. 

□ No item of information contained in this information disclosure statement was cited in a communication from a foreign 
patent office in a counterpart foreign application or, to the knowledge of the undersigned, having made reasonable inquiry, 
was known to any individual designated in 37 CFR §1.5 6(c) more than three months prior to the filing of this statement. 

DEPOSIT ACCOUNT 

■ Please charge any additional fees for the papers being filed herewith and for which no check is enclosed herewith, or credit 
any overpayment to deposit account number 15-0030 . A duplicate copy of this sheet is enclosed. 

Respectfully submitted, 

OBLON, SPIVAK, McCLELLAND, 
MAIER & NEUSTADT, P.C. 




Ma 

Registration No. 24,9 1 3 




22850 

Tel. (703)413-3000 
Fax. (703)413-2220 
(OSMMN 10/98) 




SHEET 1 OF 1 



Form PTO 1449 U.S. DEPARTMENT OF COMMERCE 
(Modified) PATENT AND TRADEMARK OFFICE 

LIST OF REFEREN^QcWeD^Y' APPLICANT 

1 o\ 

I JAH 2 6 2MB 


ATTY DOCKET NO. I SERIAL NO. 

0039-7646-2RD | 09/532,535 


APPLICANT 

Tatsunori KANAI, et al. 


FILING DATE 

March 22, 2000 


GROUP 

2755 


X^t <4V U.S. PATENT DOCUMENTS 




EXAMINER 
INITIAL 






DATE 


NAME 


CLASS 


SUB 
CLASS 


FILING DATE 
IF APPROPRIATE 




AA 


5,715,397 


02/03/98 


S. S. OGAWA, et al. 










AR 


5,629,846 


05/13/97 


A. W. CRAPO 










AC* 


5,627,997 


05/06/97 


M. E. PEARSON, et al. 










An 

nU 


5,557,780 


09/17/96 


A. T. EDWARDS, et al. 










AP 


5,835,712 


11/10/98 


R. B. DuFRESNE 










AP 


5,721,912 


02/24/98 


F. M. STEPCZYK, et al. 










Mo 
















AM 
Mil 
















A 1 

Al 
















AJ 


















FOREIGN PATENT DOCUMENTS 






DOCUMENT 
NUMBER 


DATE 


COUNTRY 


TRANSLATION 
YES NO 




AK 


53031/98 


08/27/98 


AUSTRALIA 








AL 


WO 98/03928 


01/29/98 


WIPO 








AM 














AN 














AO 














AP 














AQ 












OTHER REFERENCES (including Author, Title, Date, Pertinent Pages, etc.) 




AR 


R. A. NADO, et al., SIGMOD Record, vol. 26, no. 4, pages 32 -38, XP-0021 93377, 

"EXTRACTING ENTITY PROFILES FROM SEMISTRUCTURED INFORMATION SPACES", December 1997 




AS 


B. ADELBERG, ACM Proceedings of SIGMOD. International Conference on Management of Data, vol. 27, no. 2, 
pages 1 - 25, XP-002949327, "NoDoSE - A TOOL FOR SEMI-AUTO MATICALLY EXTRACTING STRUCTURED AND 
SEMISTRUCTURED DATA FROM TEXT DOCUMENTS", 1998 




AT 


N ASHISH, et al., Proceedings of the Second IFCIS International Conference on Kiawah Island, pages 160 - 169, 
XP-01 0240791, "SEMI-AUTOMATIC WRAPPER GENERATION FOR INTERNET INFORMATION SOURCES", 
1997 




AU 


D. FLORESCU, etal., SIGMOD Record, vol.27, no. 3, pages 59-74, XP-002193378, 
"DATABASE TECHNIQUES FOR THE WORLD-WIDE WEB: A SURVEY", September 1998 




AV 






AW 




Examiner 






Date Considered 


•Examiner Initial if reference is considered, whether or not citation is in conformance with MPEP 609; Draw line through citation if not in conformance and not 
considered. Include copy of this form with next communication to applicant. 
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RECORDING - HELICAL-SCAN DIGITAL VIDEO CASSETTE RECORDING 
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FOREWORD 

T^^^ SiS' Z ALraancB with conditions determined by agreement between the two 
organizations. 

<?\ The formal decisions or agreements of the IEC on technical matters express, as nearly as possible an 

2) K2rn^ on the relevant subjects since each technical committee has represented 
from all interested National Committees. 

3) The documents produced have the form of recommendations for international use and are ,orm 
' If standards technica. reports or guides and they are accepted by the National Comm.ttees ,n that sense. 

4) in order to promote international unification. IEC National Committees undertake to apply IEC = Internal ,onal 

indicated in the latter. 

5) The IEC provides no marking procedure to indicate its approval and cannot be rendered responsible for any 
equipment declared to be in conformity with one of its standards. 

6) Attention is drawn to the possibility that some of the elements of this Internationa il_ S»nd«rt may be the subject 
of Sent rights. The IEC shall not be held responsible for ident.fy.ng any or all such patent rights. 

International Standard IEC 61834-4 has been prepared by subcomittee 100B: Audio, video and 
mummedia information storage systems, of IEC technical comm.ttee 100: Aud,o, vdeo and 
multimedia systems and equipment. 

The text of this standard is based on the following documents: 



FDIS 


Report on voting 


100B/164/FD1S 


100B/174/RVD 



Full information on the voting for the approval of this standard can be found in the report on 
voting indicated in the above table. 
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IEC 61834 consists of the following parts: 

- Part 1: General specifications; 

- Part 2: SD format for 525-60 and 625-50 systems; 
Part 3: HD format for 1125-60 and 1250-50 systems; 

_ part 4 : Park header table an d contents; 

- Part 5: Character information system. 

. • o ♦ a ^ mr Riftqa and describes the pack header table and the contents of 
^^"^ to ^w 3 H 4 o.er d eco e r S d?n g system'of he.ical-scan digits, video cassette. 

Part 1 describes the common specifications for the helical-scan digital video cassette recording 
system using 6,35 mm magnetic tape. 

Part 2 describes the specifications for 525-60 and 625-50 systems which are not included in 
Parti. 

Part 3 describes the specifications for 1125-60 and 1250-50 systems which are not included in 
Part 1 and Part 2. 

Part 5 describes the character information system which is applicab.e to the whole recording 
system of helical-scan digital video cassette. 

For manufacturing SD digital video cassette recording system, Part 1. Part 2, Part 4 and Part 5 

are referred to. 

For manufacturing HD digital video cassette recording system. Part 1, Part 2, Part 3, Part 4 
and Part 5 are referred to. 

This part of IEC 61834 is to be referred to particularly when the pack header table and the 
contents are to be checked. 

A bilingual version of this standard may be issued at a later date. 
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9 VAUX 

VAUX 0 

9.1 SOURCE 

MSB LSB 



PCO 


0 110 
i i i 


0 0 0 0 

i i i 


PC 1 


TENS of 
TV CHANNEL 


UNITS of 
TV CHANNEL 


PC 2 


B/W EN 


CLF 


HUNDREDS of 
TV CHANNEL 


PC 3 


SOURCE 
CODE 


50/60 STYPE 


PC 4 


TUNER CATEGORY 

l 1 1 1 1 1 L_ 



This pack shall be recorded at least in the VAUX main area. 
TV CHANNEL: The number of the television channel 

001 to 999 = Television channel 

EEEh = Pre-recorded tape or LINE (MUSE) 

FFFh = No information 

TV CHANNEL should indicate the channel number which is assigned to the broadcasting 
station, and it may indicate the channel number which is set by the user on the receiver. 

B/W: Black and white flag 

0 = Black and white 

1 = Colour 

B/W flag should be set to 1 for consumer digital VCR. 

EN: Colour frames enable flag 

0 = CLF is valid 

1 = CLF is invalid 

CLF: Colour frames identification code (refer to ITU-R Report 624-4) 

For 525-60 system 

00b = Colour frame A 

01b = Colour frame B 
Others = reserved 

For 625-50 system 

00b = 1st, 2nd field 

01b = 3rd, 4th field 

10b = 5th, 6th field 

11b = 7th t 8th field 

50/60: 

0 = 60 field system 

1 = 50 field system 
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B/W 


EN 


50/60 


CLF 


System 


Coloi 


jr frame 


1 


0 


0 


00 
01 


525-60 


Valid 


Colour frame A 
Colour frame B 


1 


00 
01 
1 0 
1 1 


625-50 


1st, 2nd fields 
3rd, 4th fields 
5th, 6th fields 
7th, 8th fields 


X 


1 


X 


1 1 




Invalid 





X don't care 



SOURCE CODE: 



SOURCE CODE defines the input source of the video signal in combinat.on with TV 
CHANNEL and TUNER CATEGORY as follows. 



SOURCE 


TV CHANNEL 


TUNER I 
CATEGORY 


Input 
source 


CODE 


100's 


10*s 


1*s 


00 


Fh 


Fh 


Fh 


FFh 


Camera 


01 


Eh 


Eh 


Eh 


FFh 


Line (MUSE) 


01 


Fh 


Fh 


Fh 


FFh 


Line 


1 0 


Oh 
Oh 
I 

9h 


Oh 
Oh 
I 


1h 
2h 
I 

9h 


FFh 


Cable CM 
Ch2 
I 




9h 




Ch999 


1 1 


Oh 
Oh 


Oh 
Oh 
I 


1h 
2h 
I 


Prescribed value 


Tuner Ch1 
Ch2 
I 




I 

9h 


9h 


9h 




Ch999 


1 1 


Eh 


Eh 


Eh 


FFh 


Pre-recorded tape 


1 1 


Fh 


Fh 


Fh 


FFh 


No information 



STYPE: 



STYPE defines a sig 



nal type of video signal in combination with the 50/60 flag as follows. 



STYPE 


50/60 


0 


1 


0 0 0 0 0 


525-60 system 


625-50 system 


00001 


Reserved 


0001 0 


1125-60 system I 1250-50 system 


0001 1 
11111 


Reserved 
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TUNER CATEGORY: 

TUNER CATEGORY consists of area number and satellite number as follows. 
TUNER CATEGORY = FFh is indicative of no information. 



Area number 
l l 



Satellite number 

-I I I L_ 



b7 b6 b5 b4 b3 b2 b1 bO 



Area number specification 



Area number 


Region 


Area 


0 0 0 


Region 1 


Europe, Africa 


0 0 1 


0 10 


Region 2 


North America, South America 


0 1 1 


1 0 0 


1 0 1 


1 1 0 


Region 3 


Asia, Oceania 


1 1 1 



Details of area number are to be decided. 
For region 1 



Area number 


Satellite number 


Satellite name 


0 0 0 


0 0 0 0 0 
0 0 0 0 1 
0 0 0 1 0 
0 0 0 1 1 
0 0 1 0 0 
0 0 10 1 


UHF/VHF 


Reserved 


ASTRA A+B 
ASTRA C+D 
TELECOM (France) 
TELECOM-2 


0 0 110 

11111 


Reserved 


0 0 1 


0 0 0 0 0 


UHF/VHF 


0 0 0 0 1 

t 

1 

11111 


Reserved 
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For region 2 



Area number 


Satellite number 


Satellite name 




0 0 0 0 0 


UHF/VHF 


0 10 


0 0 0 0 1 
i 

11111 


Reserved 


0 1 1 


0 0 0 0 0 


UHF/VHF 


0 0 0 0 1 

• 

11111 


Reserved 


1 0 0 


0 0 0 0 0 


UHF/VHF 


0 0 0 0 1 

1 

11111 


Reserved 


1 0 1 


ooooo 


UHF/VHF 


0 0 0 0 1 
1 
1 

11111 


Reserved 



For region 3 



Area number 


Satellite number 


Satellite name 


1 1 0 


0 0 0 0 0 
0 0 0 0 1 
0 0 0 1 0 
0 0 0 1 1 
0 0 10 0 
0 0 10 1 


UHF/VHF 
BS 

SCC-A 
SCC-B 
JCSAT-1 
JCSAT-2 


0 0 110 

11111 


Reserved 


1 1 1 


ooooo 


UHF/VHF 


0 0 0 0 1 
1 

• 

11110 


Reserved 
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VAUX 1 

9.2 SOURCE CONTROL 

MSB LSB 



PC 0 


0 1 


1 0 

I 


0 0 

I 


0 1 

I 


PC 1 


CGMS 


ISR 


CMP 


ss 


PC 2 


REC 
ST 


1 


REC MODE 


1 


DISP 


PC 3 


FF 


FS 


FC IL 


ST 


SC 


BCSYS 


PC 4 


1 


GENRE CATEGORY 
I I l l I I 



This pack shall be recorded at least in the VAUX main area. 

CGMS: Copy generation management system 

00b = Copying permitted without restriction 
01b = Not used 

10b = One generation of copying permitted 
1 1 b = No copying permitted 

If CGMS information encoded in the incoming signal is "0 0" ( a digital VCR may make a copy 
and shall encode "0 0", on "CGMS". 

If CGMS information encoded in the incoming signal is "1 0", a digital VCR may make a copy 
and shall encode "1 1", on "CGMS". 

If CGMS information encoded in the incoming signal is "1 1", a digital VCR shall not make a copy. 

Each manufacturer has the discretion to follow the rules described above unless there is any 
legislation or similar mandating this. 

ISR: Input source of just previous recording 
00b = Analogue input 
01b = Digital input 
10b = Reserved 
11b No information 

CMP: The number of times of compression 
00b = Compression once 
01b = Compression twice 
10b = Compression three times or more 
1 1 b = No information 

SS: Source and recorded situation 

00b = Scrambled source with audience restrictions 

and recorded without descrambling 
01b = Scrambled source without audience restrictions 

and recorded without descrambling 
10b = Source with audience restrictions 

or descrambled source with audience restrictions 
1 1 b = No information 

If SS = 10b, then KEY pack should be recorded in the VAUX common optional area. 
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REC ST: Recording start point 

0 = Recording start point 

1 = Not recording start point 

The duration of recording start point should be the period of 30 frames (525-60 system) or 
25 frames (625-50 system). 

REC MODE: 

00b = Original 

01b = Reserved 

10b = Insert 

11b = Invalid recording 

where 

Original: Video and two audio blocks are recorded simultaneously. 

Insert: Video area is recorded with the pre-recorded audio blocks remaining 

as they are. 

Invalid recording: Recorded video data are not taken into account. 

BCSYS: Broadcast system 

BCSYS indicates the type information of display format with DISP. 
00b = type 0 (refer to IEC 61880, EIA-608) 
01b = type 1 (refer to prETS 300 294) 
Others = Reserved 

DISP: Display select mode 



BCSYS 



DISP 



Aspect ratio and format 



Position 



01 



000 
001 
0 1 0 



4 : 3 full format 
16:9 letter box 
16 : 9 full format (squeeze) 



Not applicable 
Centre 

Not applicable 




000 
001 
01 0 
01 1 
1 00 
1 01 
1 1 0 
1 1 1 



4 
14 
14 
16 
16 
> 16 
14 



3 full format 

9 letter box 

9 letter box 

9 letter box 

9 letter box 

9 letter box 

9 full format 



16 : 9 full format (anamorphic) 



Not applicable 
Centre 
Top 
Centre 
Top 
Centre 
Centre 

Not applicable 



1 0 



000 



- 122 - 



61834-4© !EC:1998 (E) 



FF: Frame/Field flag 

FF indicates whether both fields are output in order or only one of them is output twice 
during one frame period. 

0 = Only one of two fields is output twice 

1 = Both fields are output in order 

FS: First/Second flag 

FS indicates a field which should be output during field 1 period. 

0 = Field 2 is output 

1 - Field 1 is output 



FF 


FS 


Output field 


1 


1 


Field 1 and field 2 are output In this order 


1 


0 


Field 2 and field 1 are output in this order 


0 


1 


Field 1 is output twice 


0 


0 


Field 2 is output twice 



FC: Frame change flag 

FC indicates whether the picture of the current frame is the same picture of the immediate 
previous frame. 

0 = Same picture as the immediate previous frame 

1 = Different picture from the immediate previous frame 

IL: Interlace flag 

IL indicates whether the data of two fields which construct one frame are interlaced or non- 
interlaced. 

0 = Non-interlaced 

1 = Interlaced or unrecognized 

ST: Still-field picture flag 

ST indicates the time difference between the two fields within a frame. This flag shall have 
the same value for a duration of at least three frames. 

0 = The time difference between the fields is approximately 0 s. 

■I _ jhe time difference between the fields is approximately 1,001/60 s (525-60 system) 
or approximately 1/50 s (625-50 system). 

SC: Still camera picture flag 

This flag is prepared for distinguishing a still camera picture. Still camera picture: 
Consecutive five frame of the same picture. For SC = 0, this flag may be used for displaying 
a still camera picture by stopping tape travelling automatically. 

0 =s Still camera picture 

1 = Not still camera picture 

GENRE CATEGORY: 

GENRE CATEGORY shows the category of the video source. 
The details are described in TIMER ACT DATE pack. 



61834-4© IEC:1 998 (E) 



- 123 - 



Examples of how to use FF, FS, FC, IL and ST 

There are four types of input video signals: 

- interlaced motion picture: a normal standard TV signal; 

nter^ed motion picture: a non-interlaced TV signal in a frame like a v:deo game output; 
I Z^sTX^^ P^tu- during a frame and the sti.l picture is an interlace TV signa. 

- liefd^h picture: a sti.l picture during a field and the same sti.l picture is repeated twice in a 

frame. 

If the type of an input signal is indefinite, interlaced motion picture should be selected. 

For original recording 



Interlaced motion picture 



Non-interlaced motion picture 



Frame still picture 



Field still picture 



FF 
FS 
FC 
IL 
ST 



FF 
FS 
FC 



IL 



ST 



FF 



FS 



FC 
IL 
ST 



FF 
FS 
FC 
IL 
ST 



a1 I a2 



b1 I b2 



0 
0 
0 



d I c2 



d1 I d2 



0 
0 
0 



0 
0 
0 



el I e2 



NOTE - For frame still pictures and field still pictures, 
frames and have the same frame data. 



frames b, c and d are still 
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For normal playback 



Reproducing frames from tape 



Reproducing fields from tape 



a1 



_a2 



b1 



b2 



c2 



41 



d2 



e1 



e2 
e2 



Interlaced motion picture 



OP 



a1 



a2 



b1 



b2 



c2 



d1 



d2 



otf 



a1 I a2 



>?1 I b2 



d I c2 



d2 



e1 I e2 



FS 



FC 



IL 



ST 



OT 



a1 a2 



b1 b2 



cl c2 



dl d2 



e1 e2 



OD 



.aj a2 



bl b2 



J£l £2 



d1 d2 



^1 §2. 



IL 



Non-interlaced motion picture 



FS 



FC 



Jk 



ST 



1 
1 
1 
0 
1 



1 
1 
1 
0 

1 



1 
1 
1 

0 

1 



£1 



aj §2 



& b2 



& £2 



^1 d2 



e1 e2 



OD 



al a2 



bl b2 



£1 QZ. 



dl d2 



e1 e2 



FF 



Frame stilt picture 



FS 



FC 



Jk 



ST 



1 
1 
0 
1 
0 



31 22 



J23 b2 



£3 £2 



^1 d2 



£3 S2_ 



OD 



al a2 



bl b2 



d1 d2 



£j £2. 



FF 



Field still picture 



FS 



IL 



ST 



1 
1 
0 
0 
0 



1 
1 

0 
0 
0 



1) OT output order to the TV screen 

2) OD output order to the digital interface 



NOTE - For frame still pictures and field still pictures, frames b, c 
frames and have the same frame data. 



and d are still 
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Reproducing frames from tape 



*1 



bl 



b2 b1 



b2 b1 



b2 d 



Si 



or 



Interlaced motion picture 
(field slow) 



a2 



a2 



bl 



bllbl 



b2| b2 



b2| d 



c1 



31 



a2 



M hi I b2 M I h2 d I C2 



FF 



ST 



0 
0 
0 



1 
1 
0 

1 
1 



o 

0 
0 



OT 



a1 



OP 



3l 



a2 



hi I b2 1 bl b2 b1 



b2 I d I C2 



a2 



hi I b2 bl I b2 b1 



b2 c1 I C2 



Interlaced motion picture 
(frame slow) 



UsJ 

FC 
IL 

sx 



1 
1 

0 

1 
1 



OP 



Ml 



a2 a2 



hi I b1 I b1 



b2 t?2 



™* b2 b1 



h2 I C1 I C1 



H9 b1 I b2 C1 I C2 



Non-interlaced motion picture 
(field slow) 



E 
FS 
FC 
JL. 
SI 



0 
0 
0 
0 

1 



0 

1 
1 

0 

1 



1 
1 

0 
0 



a1 I a2 



hi b2 b1 



a1 I a2 



0 
0 
0 
0 
1 



0 
1 
1 
0 

1 



J22lbl 



H1 I b2 b l 1 h9 b1 I b2 d 



b2 C1 



Si 



Si 



Frame still picture 



FF 
FS 

IL 



1 
1 

0 

1 

0 



1 
1 
1 
1 

0 



1 
1 

0 

1 

0 



1 
1 

0 

1 

0 



1 
1 

0 

1 

0 



OT 



a1 



a2 



b1 



OP 



a1 a2 



b1 



b2| b1 



p2l b1 



b2 I b1 I b2 I d 



h2 I b1 I b2 I c1 



c2 



c2 



Field still picture 



FF 

FC 
IL 
ST 



1 
1 
1 
0 
0 



1 
1 

0 
0 
0 



1 
1 

0 
0 
0 



1 
1 

0 
0 
0 



1) OT output order to the TV screen. 

2) OP output order to the digital interface. 



NOTE - For frame still pictures and field still pictures, frames b, c and d are still frames 
and have the same frame data. 



For slow playback (X -1/3) 
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Reproducing frames from tape 




e 




d 




d 




d 




c 


ReDroducina fields from tape 


a1 






HO 


H1 


HO 


Hi 


HO 

az 


*»i 


«»o 




OT 1 


) p1 
' c 1 


o1 
c I 


HO 


HO 


HO 


Hi 


Hi 


Hi 
01 


cz 


CZ 




OD 2 


] e1 

c 1 




H1 


HO 


0 1 


HO 
GZ 


Hi 

ai 


HO 
□Z 


*»i 
CI 


CZ 




FF 


■ 


ft 

u 




0 




1 




0 




0 


Interlaced motion picture 


FS 






















(field slow) 




1 




0 




0 




1 




0 




PC 




o 




1 




0 




o 




1 




IL 




1 




1 




1 




1 




1 




ST 




I 




1 




1 




1 




1 




OT 


e i 


e^ 


d1 


d2 


d1 


d2 


d1 


d2 


C1 


c2 




PD 


e i 




d1 


d2 


d1 


d2 


d1 


d2 


-cl. 


c2 




FF 








1 




1 




i 




1 


Interlaced motion picture 
















1 




1 


(frame slow) 


FS 




1 




1 


1 








FC 




0 




1 




0 




0 


1 




IL 




1 


1 


1 




1 




1 




ST 




1 




1 


1 






1 




1 






?1 


?1 


,.d2 


d2 


...d2 




d1 


01 


.92 


c2 




OD 






01 


d2 


d1 


d2 


01 


.d2 


C1 


C2 




pp 


< 


3 


( 


) 


1 




( 


3 


( 


3 


Non-interlaced motion picture 


FS 


1 


0 


0 


1 


0 


(field slow) 






















FC 


0 


1 




0 


0 


1 




IL 


0 


0 


0 




0 


0 




-SI_ 


1 




1 




1 




1 




1 






KJ 1 


e1 


e2 


d1 


d2 


d1 


d2 


d1 


d2 


C1 


C2 




OD 


_£L 


e2_ 


01 


_02. 


01 


02 


d1 


d2 


p1 


P2 




FF 


1 




1 




1 




1 




1 




Frame still picture 


FS 


1 




1 




1 




1 




1 






FC 


0 




1 




0 




C 




G 






IL 


1 




1 




1 




1 




1 






ST 


0 




0 




0 




0 




0 






OT 


e1 


e2 


d1 


d2 


d1 


d2 


d1 


d2 


d 


C2 




OD 


el 


e2 


d1 


d2 


dl 


d2 


d1 


d2 


d 


c2 




FF 


1 




1 




1 




1 




1 




Field still picture 


FS 


1 




1 




1 




1 




1 






FC 


0 




1 




0 




0 




0 






Jk_ 


0 




0 




0 




0 




0 






ST 


0 




0 




0 




0 




0 





1) OT output order to the TV screen 

2) OD output order to the digital interface 



NOTE - For frame still pictures and field still pictures, frames b, c and d are still frames 
and have the same frame data. 
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For still playback 
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For still playback (concluded) 



Reproducing frames from tape 




a 




3 




3 




b 


1 


3 


Reproducing fields from tape 


a1 


a2 


b1 


b2 


b1 


b2 


b1 


b2 


b1 


b2 




OT 1 * 


a1 


a2 


b1 


b2 


b1 


b2 


b1 


b2 


b1 


b2 




2) 

OD ' 


a1 


a2 


b1 


b2 


b1 


b2 


b1 


b2 


b1 


b2 




FF 




1 




I 






I 






1 




I 


Frame still picture 


FS 




1 


1 






I 




1 




1 




FC 




D 


1 






0 




0 


0 




IL 




1 




1 






I 




1 




1 




ST 




0 


< 


) 




0 










0 




OT 


a1 


a2 


b1 


b2 


b1 


b2 


b1 


b2 


b1 


b2 




OD 


a1 


a? 


b1 


b2 


b1 


b2 


b1 


b2 


b1 


b2 




FF 




1 


1 






I 




1 




I 


Field still picture 


FS 


1 


1 




1 






1 


1 




FC 




[> 




1 






0 






D 


0 




IL 




0 


0 




0 






D 




0 




ST 




0 


0 




0 






D 




D 


1) OT output order to the TV screen 

2) OD output order to the digital interface 
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For fast playback 
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VAUX 2 

9.3 REC DATE (Recording date) 



MSB LSB 



PCO 


0 


1 


1 0 


0 0 10 
I I I 


PC 1 


DS 


TM 


TENS of 
TIME ZONE 


UNITS of 
TIME ZONE 


PC 2 


1 1 


TENS of 
DAY 


UNITS of 
DAY 


PC 3 


WEEK TNMN 


UNITS of 
MONTH 


PC 4 


TENS of 
I YEAR I 


UNITS of 
. YEAR , 



This pack should be recorded in the VAUX main area. The date when video data are recorded 
is stored in this pack. 

DS: Daylight saving time 

0 = Daylight saving time 

1 = Normal 

TM: Thirty minutes flag 

Thirty minutes unit of the time differential from GMT 
0 = 30 min 
1=0 min 

TIME ZONE: 

00 to 23 3Fh = No information 

Example 
For Tokyo 

TIME ZONE = 001001b 

PC1 = 1 1 001 001 b GMT plus 9:00 

For New York with daylight saving time 
TIME ZONE = 011001b 

PC1 = 01011001b GMT plus 19:00 

For New Delhi where 30 min unit of the time differential from GMT is adopted. 
TIME ZONE = 000101b 

PC1 = 10000101b GMT plus 5:30 

where GMT: Greenwich Mean Time 
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DAY: 



01 to 31 



3Fh = No information 



WEEK: 

0 = Sunday 4 = Thursday 

1 = Monday 5 = Friday 

2 = Tuesday 6 = Saturday 

3 = Wednesday 7 = No information 

MONTH: 

01 to 12 = January to December 
iFh = No information 

TNMN: Tens of month 

YEAR: Last two figures of year 

00 to 99 FFh = No information 



i fin 
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VAUX 3 

9.4 REC TIME 

This pack should be recorded in the VAUX main area. 

The time when video data are recorded is stored based on the SMPTE/EBU time code format. 
For not recording VAUX BINARY pack 





MSB 






LSB 


PC 0 


0 


1 


1 0 
I 


0 0 11 
I I I 


PC 1 


1 


1 


TENS of 
FRAMES 


UNITS of FRAMES 


PC 2 


1 


TENS of 
SECONDS 


UNITS of SECONDS 


PC 3 


1 


TENS of 
MINUTES 


UNITS of MINUTES 


PC 4 


1 


1 


TENS of 
HOl/RS 


UNITS of HOURS 
l 1 1 



Consumer digital VCR shall adopt the drop frame sequence. 
If FRAME is not used, FRAME shall be 3Fh. 
For recording VAUX BINARY pack 



MSB LSB 



PCO 


0 


1 


1 0 

I 


0 0 1 1 
i i i 


PC 1 


S2 


S1 


TENS of 
FRAMES 


UNITS of FRAMES 


PC 2 


S3 


TENS of 
SECONDS 


UNITS of SECONDS 


PC 3 


S4 


TENS of 
MINUTES 


UNITS of MINUTES 


PC 4 


S6 


S5 


TENS of 
HOURS 


UNITS of HOURS 
1 1 1 



S1 to S6 flags shall be recorded based on SMPTE/EBU format. 



Bit number 


S1 


S2 


S3 


S4 


S5 


S6 


VITC 


14 


15 


35 


55 


74 


75 


LTC 


10 


11 


27 


43 


58 


59 



VITC: vertical interval time code 
LTC: linear time code 
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9.5 BINARY GROUP 
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This pack may be recorded in the VAUX main area. 

If this pack is .used. S1 to S6 flags in VAUX REC TIME pack shall be set based on the 
SMPTE/EBU time code format. 

If this pack is not used, NO INFO pack shall be recorded. 



9.6 CLOSED CAPTION 
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This pack should be recorded in the VAUX main area. 

Closed caption data should be stored in VALUj : CLOSEC > CAPTION pack <^f x ^^^*g 
shall be stored from next bit of start bits as a LSB. If the data which concern VAUX SOURCE, VAUX 

sei o 10b I? VAUX CLOsirCAPTION packs have been recorded on tape, closed caption signals 
ShoSd be reconstructed and added to line 21 in each field of the vertical blanking period. 

More details are given in 9.5 of Part 2. 
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9.7 TR (Transparent) 
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This pack should be recorded in the VAUX main area. 

-i-i-.- » \/ai iv n nc?pn CAPTION Dack VAUX TR pack is prepared for preserving digital 
n addition to VAUX CLOSbU oak i hjin pai,*, v "y~ ^ cnT .. r 9 w ithniit chanae If these 
riata such as Video ID WSS (wide screen signalling) and EDTV-2 ID without cnange. " }™ ft 

^'iol^E^R^^^U ^: !°F "n« a5d,o MODE. a,e transmitted the 

added in the appropriate lines of the vertical blanking period. 
More details are given in 9.5 of Part 2. 

DATATYPE: 

0 = Video ID 

1 = WSS 

2 = EDTV-2 ID in 22 line 

3 = EDTV-2 ID in 285 line 
Fh = No information 
Others = Reserved 

For recording Video ID data 

Video ID data of one horizontal line consists of 20 bits. The data shall be stored from the side 
of horizontal sync as an LSB. All 20 bits of data shall be stored m the VAUX TR pack. 

For recording WSS data 

WSS data of one horizontal line consists of 14 bits. The ; data .shall be ^stored from next bit of 
start bits as an LSB. All 14 bits of data shall be stored m the VAUX Tfl pack. 

^TJ^™££«* - consists of 27 bits. The data ^-^^ 
side of horizontal sync as an LSB. 24 bits of the data except the last 3 bits for tliscrim.nator 
shall be stored in the VAUX TR pack. 
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VAUX 7 



9.8 TELETEXT 
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PC 1 



PC 2 



PC 3 



-TELETEXT DATA- 




This pack may be recorded in the VAUX common optiona. area. 

Step 1: Gathering teletext data in one horizontal line ^ 

r=c^ as an LSB to the 

end of teletext signal in order. 



Step 2: Packing teletext data in TELETEXT packs ^ ^ 

The TELETEXn r IDs ^^™^£££Z££ last TELETEXT pack, the 
S£SfFi h iS ffl.SSS ~ Ir-onn-Hon shall be filled. 
S,ep 3: data in VAUX common ^n^a ^ 

The aueue of teletext recording packs consists >of ^a VAUX ^ HEADER pack shall 

InFO pSI. H needed, and TELETEXT ^J^^^^^e. one video frame. In the final 
be set to Ah. This W™^*^^ n to d*e**\\»e recorded. 
TELETEXT pack in one video frame, tne iem. 



Teletext ID: 
Teletext ID cons 



ists of System ID. Odd / Even and Line ID. 



System ID I O/E 



Line ID 

j 1— u 



b7 b6 b5 b4 



b 3 b2 b1 bO 



% ~« -sKEs-jsr. ass &ssx&jM > 

service Recommendation 653) 

01b = NABTS teletext system (teletext type C in ITU m « 

10b = Reserved teletext type B in ITU-R 

i m /rmi <?PB492 - December 199<£, teieiexi iyp= 
11b = UK teletext system (EBU tsKts^a-s 
Recommendation 653) 
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O/E: Odd / Even 

0 = Odd field or first field 

1 = Even field or second field 

Line ID: Line number ID 
For 525-60 system 

0 to OCh = Actual line number 
ODh to 1Eh = Reserved 

1 Fh = Terminate code 

For O/E = 0, Actual line number = 10 + Line ID 
For O/E = 1 , Actual line number = 272 + Line ID 

For 625-50 system 

0 to 1 1h - Actual line number 

12h to 1Eh = Reserved 

1Fh = Terminate code 

For O/E = 0, Actual line number = 6 + Line ID 
For O/E = 1 , Actual line number = 318 + Line ID 

Procedure for recording teletext data 



VAUX common 
optional area 



First 
queue 



Teletext ID 



Teletext data in 
one horizontal line 



VAUX 
TEXT 
HEADER 



TEXT TYPE 



TELETEXT packs 



F F 

r7 



Second 
queue 



VAUX 


6 






6 




6 




TEXT 


7 






7 




7 




HEADER 

















Last 
queue 



VAUX 
TEXT 
HEADER 



F F 
F F~ 



Terminate code 
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9.9 TEXT HEADER 
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1 


1 
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I 



This pack may be recorded or written in the common optional areas. 

TDP: Total number of text data (see Figure 55 of Part 2) 

For tape, total number of TEXT packs which follow this pack 
For MIC,' total number of text data bytes which follow PC3 

TEXT TYPE: 



0 = Name 

1 = Memo 

2 = Station 

3 = Model 

6 - Operator 



7 = Subtitle 

8 = Outline 

9 = Full screen 

Ah = Teletext header 
Ch = One byte coded font 



Dh = Two byte coded font 
Eh = Graphic 
Fh = No information 
Others = Reserved 



OPN: Option number 

OPN is the option number of UK teletext. More details are given in teletext specification 

(EBU SPB 492 - December 1992). 

If OPN is not used, OPN shall be 1 1 1 b. 

TEXT CODE: 

TEXT CODE designates the character set. The details are described in CONTROL TEXT 
HEADER pack. 
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9.10 TEXT 
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This pack may be recorded in the common optional areas on tape. 

This pack contains font data, graphic data, text data according to TEXT TYPE designated in 
VAUX TEXT HEADER pack. 
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9.11 VAUX START 



VAUX 10 
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1 0 
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10 10 

1 1 _l 






DF 


TENS of 


UNITS of 


PC 1 


1 
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FRAMES 
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UNITS of 




SECONDS 


SECONDS 


PC 3 




TENS of 


UNITS of 




MINUTES 


MINUTES 


PC 4 




TENS of 


UNITS of 




! HOyRS , 


, HOURS , 



This pack may be recorded or written in the common optional areas except for the AAUX 

optional area. 

This pack shows the tape position of starting to insert video data using title time code. 

DF: Drop frame flag 

0 = Drop frame mode 

1 = Non drop frame mode 

Drop frame sequence shall be based on SMPTE/EBU format. 
For consumer digital VCR, DF shall be 0. 

FRAMES: 

For 525-60 or 1125-60 system 

00 to 29 
For 625-50 or 1250-50 system 

00 to 24 

SECONDS: 

00 to 59 

MINUTES: 

00 to 59 



HOURS: 

00 to 23 
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9.12 VAUX START 
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This pack may be recorded or written in the common optional areas except for the AAUX 

optional area. 

This pack shows the tape position of starting to insert video data using absolute track number. 

ABSOLUTE TRACK NO.: 

Absolute track number which shows the tape position of starting to insert video data 

TT: Temporary true 

This flag is valid only for MIC. 

0 = This event data in MIC does not always exist on tape 

1 = jhj S event data in MIC exists on tape certainly 
For subcode, AAUX and VAUX, TT shall be 1 . 

TEXT: 

This flag is valid only for MIC. 

0 = Text information exists 

1 = No text information exists 

For subcode, AAUX and VAUX, TEXT shall be 1 . 

GENRE CATEGORY: 

GENRE CATEGORY shows the category of the inserted video source. 
The details are described in TIMER ACT DATE pack. 
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9.13 MARINE/MOUNTAIN 
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TENS of 
TEMPERATURE 


UNITS of 
TEMPERATUF 


IE 


PC 3 


UNITS of 
PRESSURE 


THPR 


1 0 J NP 


HDRT 


PC 4 


HUNDREDS of 
PRESSURE, 


TENS Of 
,PRESSURE, 



This pack may be recorded or written in the common optional areas. 

This pack contains the temperature and pressure data of the location where the recording was made. 

CF: Centigrade/Fahrenheit 

0 = Fahrenheit 

1 = Centigrade 

CATEGO: Category code 

0 = Marine 

1 = Mountain 
Others = Reserved 

NP: Negative/positive 

NP shows the positive and negative sign of the temperature data. 

0 = Negative 

1 = Positive 

PRESSURE: 

0 000 hPa to 1 999 hPa 

HPR: Thousands of pressure 

TEMPERATURE: 

000,0 to 199,9 



HDRT: Hundreds of temperature 
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VAUX12 

9.13 MARINE/MOUNTAIN (continued) 
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This pack may be recorded or written in the common optional areas. 

This pack contains the temperature and pressure data of the location where the 
was made. 

CF: Centigrade/Fahrenheit 

0 = Fahrenheit 

1 = Centigrade 

CATEGO: Category code 

0 = Marine 

1 = Mountain 
Others = Reserved 

NP: Negative/positive 

NP shows the positive and negative sign of the temperature data. 
0 = Negative 
r 1 s Positive 

ATM PRESSURE: 

000,0 atm to 199,9 atm 
where atm = hPa / 1 013,25 

HDPR: Hundreds of atm pressure 

TEMPERATURE: 

000,0 to 199,9 

HDRT: Hundreds of temperature 
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9.13 MARINE/MOUNTAIN (concluded) 



PCO 


0 110 

i l J 1 


1 


1 0 


0 


PC 1 


1/10 of HEIGHT 


FM 


CATEGO 


1 


PC 2 


TENS of HEIGHT 


UNITS of HEIGHT 


PC 3 


THOUSANDS of 
HEIGHT 


HUNDREDS of 
HEIGHT 


PC 4 


1 1 1 NP 
I 1 


TEN THOUSANDS of 
, HEIGHT , 



This pack may be recorded or written in the common optional areas. 

This pack contains the height and depth data of the location where the recording was made. 

FM: Feet/meter 

0 = Feet 

1 = Meter 

CATEGO: Category code 

0 = Marine 

1 = Mountain 
Others = Reserved 

NP: Negative/positive 

NP shows the positive and negative sign of the height data. 

0 = Negative 

1 = Positive 



HEIGHT: 

00 000,0 to 99 999,9 
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VAUX 13 

9.14 LONGITUDE/LATITUDE 
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This pack may be recorded or written in the common optional areas. 

This pack contains the longitude data of the location where the recording was made. 

SECOND: 

00 to 59 

MINUTE: 

00 to 59 

DEGREE: 

00 to 180 

HDRD: Hundreds of degrees 

Longitude data has a valid range of 0° 00'00 M to 180° OO'OO". 

EW: East/West 

0 = East 

1 = West 
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9.14 LONGITUDE/LATITUDE (concluded) 
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This pack may be recorded or written in the common optional areas. 

This pack contains the latitude data of the location where the recording was made. 



SECOND: 

00 to 59 

MINUTE: 

00 to 59 

DEGREE: 

00 to 90 



Latitude data has a valid range of 0° OO'OO" to 90° 00*00 

NS: North/South 

0 = North 

1 = South 
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9.15 VAUX END 
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This pack may be recorded or written in the common optional areas except for the AAUX 
optional area. 

This pack shows the tape position of ending to insert video data using title time code. 

DF: Drop frame flag 

0 = Drop frame mode 

1 = Non drop frame mode 

Drop frame sequence shall be based on SMPTE/EBU format. 
For consumer digital VCR, DF shall be 0. 

FRAMES: 

For 525-60 or 1 125-60 system 

00 to 29 
For 625-50 or 1250-50 system 

00 to 24 

SECONDS: 

00 to 59 

MINUTES: 

00 to 59 

HOURS: 

00 to 23 
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9.16 VAUX END 
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This pack may be recorded or written in the common optional areas except for the AAUX 

optional area. 

This pack shows the tape position of ending to insert video data using absolute track number. 

ABSOLUTE TRACK NO.: 

Absolute track number which shows the end tape position of video insert 

BF: Blank flag 

0 = Discontinuity exists before this absolute track number. 

1 = Discontinuity does not exist before this absolute track number. 

TNT: Total number of text events 
TNT is valid only for MIC. 

TNT shows the total number of text events related to this VAUX event. 

0 to 6 7 = No information 

For subcode. AAUX and VAUX, TNT shall be 111b. 
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Status of this Memo 

This document is an Internet draft. Internet drafts are working documents of the Internet 
Engineering Task Force (IETF), its areas and its working groups. Note that other groups may also 
distribute working information as Internet drafts. 

Internet Drafts are draft documents valid for a maximum of six months and can be updated, 
replaced or obsoleted by other documents at any time. It is inappropriate to use Internet drafts as 
reference material or to cite them as other than as "work in progress". 

To view the entire list of current Internet-Drafts, please check the " lid-abstracts.txt" listing 
contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net 
(Northern Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific Rim), ftp.ietf.org (US 
East Coast), or ftp.isi.edu (US West Coast). 

Distribution of this document is unlimited. Please send comments to the mailing list at <www- 
webdav-dasl@w3.org>, which may be joined by sending a message with subject "subscribe" to 
< w w w- webdav-dasl-reques t @ w3 . org> . 

Discussions of the list are archived at 

<URL:htt p://www.w3.org/pubAVWW/Archives/Public/www-webdav-dasl >. 



This document specifies a set of methods, headers, and content-types composing DASL, an 
application of the HTTP/ 1.1 protocol to efficiently search for DAV resources based upon a set of 
client-supplied criteria. 



This document defines DAV Searching & Locating (DASL), an application of HTTP/1.1 forming 
a lightweight search protocol to transport queries and result sets and allows clients to make use of 
server-side search facilities. [DASLREQ] describes the motivation for DASL. 

DASL will minimize the complexity of clients so as to facilitate widespread deployment of 
applications capable of utilizing the DASL search mechanisms. 

DASL consists of: 



Abstract 



1. Introduction 



1.1 DASL 



http://www.webdav.org/dasl/protocol/draft-dasl-protocol-00.html 



00/03/02 



DAV Searching and Locating Protocol 



2/26 ^ — v 



<t * 

• the SEARCH method, 

• the DASL response header, 

• the DAV:searchrequest XML element, 

• the DAV:queryschema property, 

• the DAV:basicsearch XML element and query grammar, and 

• the DAV:basicsearchschema XML element. 

1.2 Relationship to DAV 

DASL relies on the resource and property model defined by [WebDAV]. DASL does not alter this 
model. Instead, DASL allows clients to access DAV-modeled resources through server-side 
search. 

1.3 Terms 

This draft uses the terms defined in [RFC2068], [WebDAV], and [DASLREQ]. 

1.4 Notational Conventions 

The augmented BNF used by this document to describe protocol elements is exactly the same as 
the one described in Section 2.1 of [RFC2068]. Because this augmented BNF uses the basic 
production rules provided in Section 2.2 of [RFC2068], those rules apply to this document as 
well. 

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", 
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be 
interpreted as described in [RFC21 19]. 

1.5 An Overview of DASL at Work 

One can express the basic usage of DASL in the following steps: 

The client constructs a query using the dav : basicsearch grammar. 
The client invokes the SEARCH method on a resource that will perform the search (the 
search arbiter) and includes a text/xml request entity that contains the query. 
The search arbiter performs the query. 

The search arbiter sends the results of the query back to the client in the response. The 
server MUST send a text/xml entity that matches the [WebDAV] PROPFIND response. 

2. The SEARCH Method 

2.1 Overview 

The client invokes the SEARCH method to initiate a server-side search. The body of the request 
defines the query. The server MUST emit text/xml entity matching the [WebDAV] PROPFIND 
response. 

The SEARCH method plays the role of transport mechanism for the query and the result set. It 
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does not define the semantics of the query. The type of the query defines the semantics. 

2.2 The Request 

The client invokes the SEARCH method on the resource named by the Request-URL 

2.2.1 The Request-URI 

The Request-URI identifies the search arbiter. 

The SEARCH method defines no relationship between the arbiter and the scope of the search, 
rather the particular query grammar used in the query defines the relationship. For example, the 
FOO query grammar may force the request-URI to correspond exacdy to the search scope. 

2.2.2 The Request Body 

The server MUST process a text/xml or application/xml request body, and MAY process request 
bodies in other formats. See [RFC 2376] for guidance on packaging XML in requests. 

If the client sends a text/xml or application/xml body, it MUST include the dav: searchrequest 
XML element. The dav : searchrequest XML element identifies the query grammar, defines the 
criteria, the result record, and any other details needed to perform the search. 

2.3 The dav: searchrequest XML Element 

<! ELEMENT searchrequest ANY > 

The dav: searchrequest XML element contains a single XML element that defines the query. 
The name of the query element defines the type of the query. The value of that element defines the 
query itself. 

2.4 The Successful 207 (Multistatus) Response 

If the server returns 207 (Multistatus), then the search proceeded successfully and the response 
MUST match that of a PROPFIND. 

There MUST be one dav : response for each resource that matched the search criteria. For each 
such response, the dav : href element contains the URI of the resource, and the response MUST 
include a dav : props tat element. 

In addition, the server MAY include dav : response items in the reply where the dav : href 
element contains a URI that is not a matching resource, e.g. that of a scope or the query arbiter. 
Each such response item MUST NOT contain a dav : props tat element, and MUST contain a 
DAV: status. It SHOULD contain a DAV : responsedescription. 

2 Al Extending the PROPFIND Response 

A response MAY include more information than PROPFIND defines so long as the extra 
information does not invalidate the PROPFIND response. Query grammars SHOULD define how 
the response matches the PROPFIND response. 
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2.4.1 Example: A Simple Request and Response 

This example demonstrates the request and response framework. The following XML document 
shows a simple (hypothetical) natural language query. The name of the query element is 
FOOmatural-language-query, thus the type of the query is FOO:natural-language-query. The 
actual query is "Find the locations of good Thai restaurants in Los Angeles". For this hypothetical 
query, the arbiter returns two properties for each selected resource. 

SEARCH / HTTP/ 1.1 
Host: ryu.com 
Content-Type : text/xml 
Connection: Close 
Content-Length: 243 

< ?xml version= " 1 . 0 M ?> 

<D:searchrequest xmlns : D = "DAV: " xmlns : F = "FOO:"> 

<F : natural -language- que ry> 

Find the locations of good Thai restaurants in Los Angeles 

</F : natural -language- que ry> 
</D:searchrequest> 

» Response 

HTTP/1.1 207 Multi-Status 
Content-Type : text/xml 
Content-Length: 333 

<?xml version= " 1 . 0 " ?> 

<D:multistatus xmlns : D= " DAV : " xmlns : F= " FOO : " 
xmlns :R="http: / /ryu . com/propschema " > 
<D:response> 

<D:href >http: //siamiam. com/</D : href > 
<D:propstat> 
<D :prop> 

<R: location>259 W. Hollywood</R : location> 
<R : ratingxR : stars>4< /R : starsx /R : rating> 
</D:prop> 
< / D : props ta t > 
< /D : response> 
</D : multistatus> 

2.5 Unsuccessful Responses 

If an error occurred that prevented execution of the query, the server MUST indicate the failure 
with the appropriate status code and SHOULD include a DAV:muitistatus element to point out 
errors associated with scopes. 

400 Bad Request. The query could not be executed. The request may be malformed (not valid 
XML for example). Additionally, this can be used for invalid scopes and search redirections. 

422 Unprocessable entity. The query could not be executed. If a text/xml request entity was 
provided, then it may have been valid (well-formed) but may have contained an unsupported or 
unimplemented query operator. 

507 (Insufficient Storage). The query produced more results than the server was willing to 
transmit. Partial results have been transmitted. The server MUST send a body that matches that 
for 207, except that there MAY exist resources that matched the search criteria for which no 
corresponding dav: response exists in the reply. 
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2.5.1 Example: Result Set Truncation 

A server MAY limit the number of resources in a reply, for example to limit the amount of 
resources expended in processing a query. If it does so, the reply MUST use status code 507. It 
SHOULD include the partial results. 

When a result set is truncated, there may be many more resources that satisfy the search criteria 
but that were not examined. 

If partial results are included and the client requested an ordered result set in the original request, 
then any partial results that are returned MUST be ordered as the client directed. 

Note that the partial results returned MAY be any subset of the result set that would have satisfied 
the original query. 

SEARCH / HTTP/ 1.1 
Host : gdr . com 
Content-Type: text/xml 
Connection: Close 
Content-Length : xxxxx 

<?xml version="l . 0" ?> 

<D : searchrequest xmlns : D= " DAV : " > 

<D : basicsearch> 

... the query goes here ... 

</D :basicsearch> 
< /D : searchreques t> 

>> Response 

HTTP/1.1 507 insufficient Storage 
Content-Type: text/xml 
Content-Length: 738 

<?xml version="l . 0"?> 
<D:multistatus xmlns : D= " DAV :" > 
<D:response> 

<D : href >http : / /www . gdr . com/sounds /unbrokenchain . au< /D : href > 
<D:propstat> 
<D:prop/> 

<D: status>HTTP/l . 1 200 OK</D : status> 
< /D: props tat> 
</D : response> 
<D:response> 

<D:href >http : //tech.mit . edu/archive9 6 /photos /Leshl . jpg</D:href > 
<D:propstat> 
<D:prop/> 

<D: status>HTTP/l . 1 200 OK</D: status> 
<D: /propstat> 
</D : response> 
<D:response> 

<D : href >http : / /gdr . com< /href > 

<D: status>HTTP/l . 1 507 Insufficient Storage</D : status> 
<D : responsedescription> 

Only first two matching records were returned 
< /D : responsedescription> 
</D : response> 
</D :multistatus> 



2.6 Invalid Scopes & Search Redirections 
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2.6.1 Indicating an Invalid Scope 

A client may submit a scope that the arbiter may be unable to query. The inability to query may 
be due to network failure, administrative policy, security, etc. This raises the condition described 
as an "invalid scope". 

To indicate an invalid scope, the server MUST respond with a 400 (Bad Request). 

The response includes a text/xml body with a DAV:multistatus element. Each dav: resource in 
the DAV:multistatus identifies a scope. To indicate that this scope is the source of the error, the 
server MUST include the dav : scopeerror element. 

2.6.2 Example of an Invalid Scope 

HTTP/1.1 400 Bad-Request 
Content-Type: text/xml 
Content-Length: xxxxx 

<?xml version="l . 0" ?> 

<d :multi status xmlns : d= " DAV : " > 
<d:response> 

<d : href >http : / /www. f oo . com/X< /d : href > 

<d:status>HTTP/l . 1 404 Object Not Found</d : status > 
<d : scopeerror /> 
< /d : response> 
</d:multistatus> 

2.6.3 Redirections 

As described above, a server can indicate only that the scope is invalid. Some search arbiters may 
be able to indicate that other search arbiters exist for that scope. 

In this case, the server MUST: 

(1) include the dav : scopeerror element 

(2) include the dav : status element for that scope. The value of this element MUST be a 303 
(See Other) response. 

(3) include the dav: redirectarbiter element for each arbiter the client should use for the 
redirect. The value of this element is the URI of the arbiter to use. Multiple 

dav : redirectarbiter elements are allowed. 

2.6.4 Example of a Search Redirection 

HTTP/1.1 400 Bad-Request 
Content-Type: text/xml 
Content-Length: xxxxx 

<?xml version= "1.0" ?> 

<?xml : namespace ns="DAV:" prefix="d" ?> 

<d :multistatus> 
<d:response> 

<d:href>http: //www. f oo . com/X</d : href > 

<d: status>HTTP/l . 1 303 See Other</d : status> 
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<d : scopeerror /> 

<d: redirectarbiter>http: //bar . com/B</d : redirectarbiter> 
<d: redirectarbiter>http: //baz . com/B</d : redirectarbi ter> 
</d : response> 
</d :multistatus> 

2.6.5 Syntax for DAV: scopeerror 

<! ELEMENT scopeerror EMPTY> 

2.6.6 Syntax for DAV: redirectarbiter 

< ! ELEMENT redirectarbiter (#PCDATA)> 



The contents must be a URL. 



3. Discovery of Supported Query Grammars 

Servers MUST support discovery of the query grammars supported by a search arbiter resource. 

Clients can determine which query grammars are supported by an arbiter by invoking OPTIONS 
on the search arbiter. If the resource supports SEARCH, then the DASL response header will 
appear in the response. The DASL response header lists the supported grammars. 

3.1 The OPTIONS Method 

The OPTIONS method allows the client to discover if a resource supports the SEARCH method 
and to determine the list of search grammars supported for that resource. 

The client issues the OPTIONS method against a resource named by the Request-URL This is a 
normal invocation of OPTIONS defined in [RFC2068]. 

If a resource supports the SEARCH method, then the server MUST list SEARCH in the 
OPTIONS response as defined by [RFC2068]. 

DASL servers MUST include the DASL header in the OPTIONS response. This header identifies 
the search grammars supported by that resource. 

3.2 The DASL Response Header 

DASLHeader = "DASL" • Coded-URL-List 
Coded-URL-List : Coded-URL [ " , " Coded-URL-List ] 
Coded-URL ; defined in section 9.4 of [WEBDAV] 

The DASL response header indicates server support for a query grammar in the OPTIONS 
method. The value is a URI that indicates the type of grammar. This header MAY be repeated. 



For example: 

DASL : <http : //foo.bar . com/syntaxl> 
DASL: <http: / /akuma . com/syntax2> 
DASL : <FOO : natural -language -query > 
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3.3 Example: Grammar Discovery 

This example shows that the server supports search on the /some folder resource with the query 
grammars: DAV:basicsearch, http://foo.bar.com/syntaxl and 

http : //akuma . com/syntax2. Note that every server MUST Support DAV:basicsearch. 
» Request 

OPTIONS /somef older HTTP/1.1 
Connection: Close 
Host: ryu.com 

» Response 

HTTP/1.1 200 OK 

Date: Tue, 20 Jan 1998 20:52:29 GMT 
Connection: close 
Accept-Ranges : none 

Allow: OPTIONS , GET, HEAD, POST, PUT, DELETE, TRACE, COPY, MOVE, MKCOL, PROPFII< 

Public: OPTIONS, , GET, HEAD, POST, PUT, DELETE, TRACE, COPY, MOVE, MKCOL, PROPF3 

DASL : <DAV:basicsearch> 

DASL : <http://foo.bar.com/syntaxl> 

DASL: <http://akuma.com/syntax2> 

4. Query Schema Discovery: QSD 

Servers MAY support the discovery of the schema for a query grammar. 

The DASL response header provides means for clients to discover the set of query grammars 
supported by a resource. This alone is not sufficient information for a client to generate a query. 
For example, the dav : basicsearch grammar defines a set of queries consisting of a set of 
operators applied to a set of properties and values, but the grammar itself does not specify which 
properties may be used in the query. QSD for the dav : basicsearch grammar allows a client to 
discover the set of properties that are searchable, selectable, and sortable. Moreover, although the 
dav: basicsearch grammar defines a minimal set of operators, it is possible that a resource 
might support additional operators in a query. For example, a resource might support a optional 
operator that can be used to express content-based queries in a proprietary syntax. QSD allows a 
client to discover these operators and their syntax. The set of discoverable quantities will differ 
from grammar to grammar, but each grammar can define a means for a client to discover what 
can be discovered. 

In general, the schema for a given query grammar depends on both the resource (the arbiter) and 
the scope. A given resource might have access to one set of properties for one potential scope, and 
another set for a different scope. For example, consider a server able to search two distinct 
collections, one holding cooking recipes, the other design documents for nuclear weapons. While 
both collections might support properties such as author, title, and date, the first might also define 
properties such as calories and preparation time, while the second defined properties such as yield 
and applicable patents. Two distinct arbiters indexing the same collection might also have access 
to different properties. For example, the recipe collection mentioned above might also indexed by 
a value-added server that also stored the names of chefs who had tested the recipe. Note also that 
the available query schema might also depend on other factors, such as the identity of the 
principal conducting the search, but these factors are not exposed in this protocol. 
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Each query grammar supported by DASL defines its own syntax for expressing the possible query 
schema. A client retrieves the schema for a given query grammar on an arbiter resource with a 
given scope by invoking the SEARCH method on that arbiter, with that grammar and scope, with 
a query whose dav: select element includes the DAV:queryschema property. This property is 
defined only in the context of such a search, a server SHOULD not treat it as defined in the 
context of a PROPFIND on the scope. The content of this property is an XML element whose 
name and syntax depend upon the grammar, and whose value may (and likely will) vary 
depending upon the grammar, arbiter, and scope. 

The query schema for dav : basicsearch is defined in section 5.19. 

4.1 The DAv:queryschema Property 

<! ELEMENT queryschema ANY > 

4.1.1 Example of query schema discovery 

In this example, the arbiter is recipes.com, the grammar is dav : basicsearch, the scope is also 
recipes.com. 

SEARCH / HTTP/ 1.1 
Host: recipes.com 
Content-Type: application/xml 
Connection: Close 
Content-Length: xxx 

<?xml version="1.0" ?> 
<D: searchrequest xmlns : D= M DAV : " > 
<D:basicsearch> 
<D:select> 

<D : queryschema/> 
</D : select> 

<D : f romxD : scopexD : href >http : / /recipes . com< /d : hrefx/D: scopex/D : f rom> 
</D :basicsearch> 
</D: searchrequest> 

Response: 

HTTP/1.1 207 Multistatus 
Content-Type: application/xml 
Content-Length: xxx 

<?xml version= " 1 . 0 " ?> 
<D:multistatus xmlns : D= " DAV: " > 
<D:response> 

<D :href >http: / /recipes . com</D : href > 
<D:propstat> 
<D:prop> 

<D : querygrairanar> 

<D:basicsearchschema> 

See section 5.19.9 for actual contents 
</D:basicsearchschema> 
</ D : que r y gramma r> 
</D:prop> 

<D: status>HTTP/l . 1 200 Okay</D : status> 
</D : props tat > 
</D : response> 
</D :multistatus> 
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5 The dav: basic search Grammar 

5.1 Introduction 



DAVibasicsearch uses an extensible XML syntax that allows clients to express search requests 
that are generally useful for WebDAV scenarios. DASL-extended servers MUST accept this 
grammar, and MAY accept others grammars. 



DAVibasicsearch has several components: 

1 . dav : select provides the result record definition. 

2. dav: from defines the scope. 

3. dav : where defines the criteria. 

4. dav: orderby defines the sort order of the result set. 

5. dav : limit provides constraints on the query as a whole. 



(select, from, where?, orderby?, limit?) > 
(allprop | prop) > 



depth?) > 

"eq | It | gt| lte | gte M > 
"and | or | not"> 
" isdef ined" > 
"like "> 
" contains " > 



5.2 The DAVrbasicsearch DTD 

< ! ELEMENT basics ear ch 

<! ELEMENT select 

<! ELEMENT from (scope) 
<! ELEMENT scope (href, 

< ! ENTITY %COmp_ops 
< ! ENTITY %log_ops 
< ! ENTITY %special_ops 
< ! ENTITY %string_ops 
<! ENTITY %content_ops 

<! ENTITY %all_ops M %comp_ops; | 

<! ELEMENT where ( %all_ops; ) > 



%log_ops; | %special_ops; | %string_ops ; | 



< ! ELEMENT 


and 


( ( %all_ops; ) +) > 






< ! ELEMENT 


or 


( ( %all_ops; ) +). > 






< ! ELEMENT 


not 


( %all_ops; ) > 






< ! ELEMENT 


It 


( prop , literal ) > 






< ! ATTLIST 


It 


casesensitive d|0) 




> 


< ! ELEMENT 


lte 


( prop , literal ) > 






< i ATTLIST 


lte 


casesensitive (1|0) 


1 


> 


< ! ELEMENT 


gt 


( prop , literal) > 






<! ATTLIST 


gt 


casesensitive d|0) 


II ^ II 


> 


< • ELEMENT 


gte 


( prop , literal ) > 






< ! ATTLIST 


gte 


casesensitive ( 1 | 0 ) 


ii ^ it 


> 


< ! ELEMENT 


eq . 


( prop , literal ) > 






< ! ATTLIST 


eq 


casesensitive (1|0) 


ii ^ n 


> 


< ! ELEMENT 


literal ( # PC DATA) > 







<! ATTLIST literal xml : space ( default | preserve) preserve > 
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< 


! ELEMENT 


isdef ined 


(prop) > 


< 


! ELEMENT 


like (prop, 


literal) > 


< 


! ELEMENT 


contains 


(# PCDATA) > 


< 


! ELEMENT 


orderby 


( order +) > 


< 


! ELEMENT 


order (prop, 


(ascending | descending) 


< 


! ATTLIST 


order casesensitive (l|0) "1" > 


< 


! ELEMENT 


ascending 


EMPTY> 


< 


! ELEMENT 


descending 


EMPTY> 



<! ELEMENT limit (nresults) > 

<! ELEMENT nresults ( # PC DATA ) > 

5.2.1 Example Query 

This query retrieves the content length values for all resources located under the servers 
"/container 1/" URI namespace whose length exceeds 10000. 

<d:searchrequest> 
<d : basicsearch> 
<d: select> 

<d : propxd : getcontentlength/x/d : prop> 
</d:select> 
<d:from> 
<d:scope> 

<d:href > /container l/</d: href > 
<d : depth>inf inity</d : depth> 
</d: scope> 
</d : f rom> 
<d : where> 
<d:gt> 

<d : propxd : getcontentlength/ >< /d : prop> 

<d: literal>10000</d: literal> 
</d:gt> 
</d: where> 
<d : orderby> 

<d:order> 

<d: propxd : getcontentlength/xd :prop> 
<d : ascending/ > 
</d:order> 
< / d : orderby> 
</d : basicsearch> 
</d:searchrequest> 

5.3 DAV: select 

dav: select defines the result record, which is a set of properties and values. This document 
defines two possible values: DAV:aiiprop and dav : prop, both defined in [WebDAV]. 

If the value is dav: allprop, the result record for a given resource includes all the properties for 
that resource. 

If the value is dav : prop, then the result record for a given resource includes only those properties 
named by the dav : prop element. Each property named by the dav : prop element must be 
referenced in the Multistatus response. 

The rules governing the status codes for each property match those of the PROPFIND method 
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defined in [WebD A V]. 
5.4 DAV: from 

dav: from defines the query scope. This contains exactly one dav: scope element. The scope 
element contains a mandatory dav : href element and an optional dav : depth element. 

dav: href indicates the URI for a collection to use as a scope. 

When the scope is a collection, if dav : depth is "0", the search includes only the collection. When 
it is "1", the search includes the (toplevel) members of the collection. When it is "infinity", the 
search includes all recursive members of the collection. 

5.4.1 Relationship to the Request-URI 

If the dav : scope element is an absolute URI, the scope is exactly that URI. 

If the dav: scope element is a relative URI, the scope is taken to be relative to the request-URL 

5.4.2 Scope 

A Scope can be an arbitrary URI. 

Servers, of course, may support only particular scopes. This may include limitations for particular 
schemes such as "http:" or "ftp:" or certain URI namespaces. 

If a scope is given that is not supported the server MUST respond with a 400 status code that 
includes a Multistatus error. A scope in the query appears as a resource in the response and must 
include an appropriate status code indicating its validity with respect to the search arbiter. 

Example: 

HTTP/1.1 400 Bad Request 
Content -Type : text/xml 
Content-Length: 42 8 

<?xml version= " 1 . 0 " ?> 

<d: multistatus xmlns : D= " DAV : " xmlns : F= "FOO : " > 
<d : response> 

<d:href >http : //www. f oo . cora/scopel</d : href > 
<d: status>HTTP/l . 1 502 Bad Gateway</d : status> 
</d:response> 
</d :multistatus> 

This example shows the response if there is a scope error. The response provides a Multistatus 
with a status for the scope. In this case, the scope cannot be reached because the server cannot 
search another server (502). 

5*5 DAV: where 

dav: where element defines the search condition for inclusion of resources in the result set. The 
value of this element is an XML element that defines a search operator that evaluates to one of the 
Boolean truth values TRUE, FALSE, or UNKNOWN. The search operator contained by 
dav: where may itself contain and evaluate additional search operators as operands, which in turn 
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may contain and evaluate additional search operators as operands, etc. recursively. 

5.5.1 Use of Three- Valued Logic in Queries 

Each operator defined for use in the where clause that returns a Boolean value MUST evaluate to 
TRUE, FALSE, or UNKNOWN. The resource under scan is included as a member of the result 
set if and only if the search condition evaluates to TRUE. 

Consult Appendix A for details on the application of three-valued logic in query expressions. 

5.5.2 Handling Optional operators 

If a query provides an operator that is not supported by the server, then the server MUST respond 
with a 422 (Unprocessable Entity) status code. 

5.5.3 Treatment of NULL Values 

If a SEARCH PROPFIND for a property value would yield a 404 or 403 response for that 
property, then that property is considered NULL. 

NULL values are "less than" all other values in comparisons. 

Empty strings (zero length strings) are not NULL values. An empty string is "less then" a string 
with length greater than zero. 

The dav : isdef ined operator is defined to test if the value of a property is NULL. 

5.5.4 Example: Testing for Equality 

The example shows a single operator (dav : eq) applied in the criteria. 

<d:where> 
<d:eq> 

<d:prop> <d:getcontentlength/> </d:prop> 
<d:literal> 100 </d:literal> 
</d:eq> 

</d:where> 

5.5.5 Example: Relative Comparisons 

The example shows a more complex operation involving several operators (dav = and, DAVreq, 
dav : gt) applied in the criteria. This dav : where expression matches those resources that are 
"image/gifs" over 4K in size. 



<D:where> 
<D : and> 
<D: eq> 

<D:prop> <D:getcontenttype/> </D:prop> 
<D:literal> image/gif </D:literal> 

</D:eq> 

<D:gt> 

<D:prop> <D:getcontentlength/> </D:prop> 
<D:literal> 4096 </D:literal> 
</D:gt> 
< / D : and> 
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o n .1 

</D:where> 

5.6 DAV: orderby 

The dav: orderby element specifies the ordering of the result set. It contains one or more 
dav: order elements, each of which specifies a comparison between two items in the result set. 
Informally, a comparison specifies a test that determines whether one resource appears before 
another in the result set. Comparisons are applied in the order they occur in the dav : orderby 
element, earlier comparisons being more significant. 

The comparisons defined here use only a single property from each resource, compared using the 
same ordering as the dav: it operator (ascending) or DAV:gt operator (descending). If neither 
direction is specified, the default is dav : ascending. 

In the context of the dav : orderby element, null values are considered to collate before any actual 
(i.e., non null) value, including strings of zero length (as in ANSI standard SQL, [ANSISQL]). 

5.6.1 Comparing Natural Language Strings. 

Comparisons on strings take into account the language defined for that property. Clients MAY 
specify the language using the xmlrlang attribute. If no language is specified either by the client or 
defined for that property by the server or if a comparison is performed on strings of two different 
languages, the results are undefined. 

The dav: casesensitive attribute may be used to indicate case-sensitivity for comparisons. 

5.6.2 Example of Sorting 

This sort orders first by last name of the author, and then by size, in descending order, so that the 
largest works appear first. 

<d : orderby> 
<d:order> 

<d:propxr : lastname/x/d :prop> 

<d: ascending/> 
</d:order> 
<d: order> 

<d : propxd : getcontentlength/ >< /d : prop> 
<d : descending/ > 
</d:order> 
</d:orderby> 

5.7 Boolean Operators: dav : and, dav : or, and dav : not 

The dav: and operator performs a logical AND operation on the expressions it contains. 
The dav: or operator performs a logical OR operation on the values it contains. 
The dav : not operator performs a logical NOT operation on the values it contains. 

5.8DAV:eq 

The dav: eq operator provides simple equality matching on property values. 
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The dav: casesensitive attribute may be used with this element. 

5.9 DAV: It, DAV:lte, DAV:gt, DAVrgte 

The dav: it, dav: lte, dav: gt, and DAV:gte operators provide comparisons on property values, 
using less-than, less-than or equal, greater-than, and greater-than or equal respectively. The 
dav: casesensitive attribute may be used with these elements. 

5.10 DAV: literal 

dav: literal allows literal values to be placed in an expression. 

Because white space in literal values is significant in comparisons, dav : literal makes use of 
the xml:space attribute to identify this significance. The default value of this attribute for 
dav: literal is preserve. Consult section 2.10 of [XML] for more information on the use of this 
attribute. 

5.11 DAV:isdefined 

The dav: isdef ined operator allows clients to determine whether a property is defined on a 
resource. The meaning of "defined on a resource" is found in section 5.5.3. 

Example: 

<d: isdef ined> 

<d : propxx : someprop / >< / d : pr op> 
</d: isdef ined> 

The dav : isde fined operator is optional. 

5.12 DAV: like 

The dav: like is an optional operator intended to give simple wildcard-based pattern matching 
ability to clients. 

The operator takes two arguments. 

The first argument is a dav : prop element identifying a single property to evaluate. 
The second argument is a dav : literal element that gives the pattern matching string. 
5.12.1 Syntax for the Literal Pattern 

Pattern := [wildcard] 0* ( text [wildcard] ) 
wildcard : = exactlyone | zeroormore 
text := 1*( <octet> | escapesequence ) 
exactlyone : = " ?" 
zeroormore := "%" 
escapechar : = " \ " 

escapesequence := \ ( exactlyone | zeroormore | escapechar ) 

The value for the literal is composed of wildcards separated by segments of text. Wildcards may 
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begin or end the literal. Wildcards may not be adjacent. 
The "?" wildcard matches exactly one character. 
The "%" wildcard matches zero or more characters 

The "\" character is an escape sequence so that the literal can include "?" and "%". To include the 
"\" character in the pattern, the escape sequence "W" is used.. 

5.12.2 Example of DAV: like 

This example shows how a client might use dav: like to identify those resources whose content 
type was a subtype of image. 

<D : where> 
<D:like> 

<D : propxD : getcontenttype/x/D : prop> 
<D : literal>image%</D : literal> 
</D:like> 
</D : where> 

5.13 DAV: contains 

The dav: contains operator is an optional operator that provides content-based search capability. 
This operator implicitly searches against the text content of a resource, not against content of 
properties. The dav : contains operator is intentionally not overly constrained, in order to allow 
the server to do the best job it can in performing the search. 

The dav: contains operator evaluates to a Boolean value. It evaluates to TRUE if the content of 
the resource satisfies the search. Otherwise, It evaluates to FALSE. 

Within the dav : contains XML element, the client provides a phrase: a single word or 
whitespace delimited sequence of words. Servers MAY ignore punctuation in a phrase. Case- .- 
sensitivity is left to the server. 

The following things may or may not be done as part of the search: Phonetic methods such as 
"soundex" may or may not be used. Word stemming may or may not be performed. Thesaurus 
expansion of words may or may not be done. Right or left truncation may or may not be 
performed. The search may be case insensitive or case sensitive. The word or words may or may 
not be interpreted as names. Multiple words may or may not be required to be adjacent or "near" 
each other. Multiple words may or may not be required to occur in the same order. Multiple words 
may or may not be treated as a phrase. The search may or may not be interpreted as a request to 
find documents "similar" to the string operand. 

The dav: score property is intended to be useful to rank documents satisfying the dav : contains 
operator. 

5,13.1 Examples 

The example below shows a search for the phrase "Peter Forsberg". 

Depending on its support for content-based searching, a server MAY treat this as a search for 
documents that contain the words "Peter" and "Forsberg". 



http://www.webdav.org/dasl/protocol/draft-dasl-protocol-00.htrnl 



00/03/02 



DAV Searching and Locating Protocol 



17/26 — ;> 



r 0 i 

<D:where> 

<D : contains>Peter Forsberg</D : contains> 
</D : where> 

The example below shows a search for resources that contain "Peter" and "Forsberg". 

<D:where> 
<D : and> 

<D : contains>Peter</D : contains> 
<D : contains>Forsberg< /D : contains> 
</D:and> 
</D:where> 

5*14 The dav: limit XML Element 

<! ELEMENT limit (nresults) > 

The dav : limit XML element contains requested limits from the client to limit the size of the 
reply or amount of effort expended by the server. 

5.15 The dav: nresults XML Element 

<! ELEMENT nresults (#PCDATA)> ; only digits 

The dav: nresults XML element contains a requested maximum number of records to be 
returned in a reply. The server MAY disregard this limit. The value of this element is an integer. 

5.16 The dav: casesensitive XML attribute 

The dav: casesensitive attribute allows clients to specify case-sensitive or case-insensitive 
behavior for dav : basicsearch operators. 

The possible values for dav : casesensitive are "1" or "0". The "1" value indicates case- 
sensitivity. The "0" value indicates case-insensitivity. The default value is server-specified. 

Support for the dav : casesensitive is optional. A server should respond with an error 422 if the 
dav : casesensitive attribute is used but cannot be supported. 

5.17 The dav: score Property 

< ! ELEMENT score ( # PCDATA) > 

The dav : score XML element is a synthetic property whose value is defined only in the context 
of a query result where the server computes a score, e.g. based on relevance. It may be used in 
dav: select or dav: orderby elements. Servers SHOULD support this property. The value is a 
string representing the score, an integer from zero to 10000 inclusive, where a higher value 
indicates a higher score (e.g. more relevant). 

Clients should note that, in general, it is not meaningful to compare the numeric values of scores 
from two different queries unless both were executed by the same underlying search system on the 
same collection of resources. 



http://www.webdav.org/dasl/protocol/draft-dasl-protocol-00.html 



00/03/02 



DAY Searching and Locating Protocol 



18/26 -*< — y 



5.18 The DAV:iscoiiection Property 

<! ELEMENT iscollection (#PCDATA)> 

The dav: iscollection XML element is a synthetic property whose value is defined only in the 
context of a query. 

The property is TRUE (the literal string " 1 ") of a resource if and only if a PROPFIND of the 
dav: resourcetype property for that resource would contain the dav : collection XML 
element. The property is FALSE (the literal string "0") otherwise. 

Rationale: This property is provided in lieu of defining generic structure queries, which would 
suffice for this and for many more powerful queries, but seems inappropriate to standardize at this 
time. 



5.18.1 Example of DAV: iscollection 



This example shows a search criterion that picks out all and only the resources in the scope that 
are collections. 

<D:where> 
<D : eq> 

<D :prop><D : iscollectionx/D: prop> 
<D: literal>l<D: literal> 
</D:eq> 
</D : where> 



5.19 QuerySchema for DAV:basicsearch 

The dav : basicsearch grammar defines a search criteria that is a Boolean-valued expression, 
and allows for an arbitrary set of properties to be included in the result record. The result set may 
be sorted on a set of property values. Accordingly the DTD for schema discovery for this grammar 
allows the server to express: 

1 . the set of optional operators defined by the resource. 

5.19.1 DTD for DAV : basicsearch QSD 

<! ELEMENT basicsearchschema (properties, operators )> 

<! ELEMENT properties (propdesc*)> 

<! ELEMENT propdesc (prop, ANY) > 

<! ELEMENT operators (opdesc*)> 

<! ELEMENT opdesc ANY> 

< ! ELEMENT operand_property EMPTY > 

<! ELEMENT operand_li teral EMPTY > 



The dav: properties element holds a list of descriptions of properties. 

The dav: operators element describes the optional operators that may be used in a dav : where 
element. 

5.19.2 dav: propdesc Element 
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Each instance of a DAV:propdesc element describes the property or properties in the dav : prop 
element it contains. All subsequent elements are descriptions that apply to those properties. All 
descriptions are optional and may appear in any order. Servers SHOULD support all the 
descriptions defined here, and MAY define others. 

DASL defines five descriptions. The first, dav: datatype, provides a hint about the type of the 
property value, and may be useful to a user interface prompting for a value. The remaining four 
(DAV: searchable, DAV: selectable, DAV: sortable, and DAV : casesensitive) identify 
portions of the query (dav : where, dav : select, and DAV:orderby, respectively). If a property 
has a description for a section, then the server MUST allow the property to be used in that section. 
These descriptions are optional. If a property does not have such a description, or is not described 
" at all, then the server MAY still allow the property to be used in the corresponding section. 

5.19*3 The dav: datatype Property Description 

The dav: datatype element contains a single XML element that provides a hint about the domain 
of the property, which may be useful to a user interface prompting for a value to be used in a 
query. The namespace for expressing a DASL defined data type is "urn:uuid:C2F41010-65B3- 
1 ldl-A29F-00AA00C14882/ n . 

< ! ELEMENT datatype ANY > 



DASL defines the following data type elements: 


Name 


example 


boolean 


1,0 


string 


Foobar 


dateTime.iso8601tz 


1994-1 1-05T08:15:5Z 


float 


.314159265358979E+1 


int 


-259, 23 



If the data type of a property is not given, then the data type defaults to string. 

5.19.4 The dav: searchable Property Description 

<! ELEMENT searchable EMPTY > 

If this element is present, then the server MUST allow this property to appear within a dav : where 
element where an operator allows a property. Allowing a search does not mean that the property is 
guaranteed to be defined on every resource in the scope, it only indicates the server's willingness 
to check. 

5.19.5 The dav: selectable Property Description 

<! ELEMENT selectable EMPTY > 

This element indicates that the property may appear in the dav : select element. 
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5.19.6 The DAV: sort able Property Description 

This element indicates that the property may appear in the dav: orderby element 

<! ELEMENT sortable EMPTY > 

5.19.7 The DAVzcasesensitive Property Description 

This element only applies to properties whose data type is "string" as per the dav : datatype 
property description. Its presence indicates that compares performed for searches, and the 
comparisons for ordering results on the string property will be case sensitive. (The default is case 
insensitive.) 

<! ELEMENT casesensi tive EMPTY > 

5.19.8 The DAV : operators XML Element 

The dav: operators element describes every optional operator supported in a query. (Mandatory 
operators are not listed since they are mandatory and permit no variation in syntax.). All optional 
operators that are supported MUST be listed in the dav : operators element. The listing for an 
operator consists of the operator (as an empty element), followed by one element for each 
operand. The operand MUST be either dav: operand_property or dav: operand_literal, which 
indicate that the operand in the corresponding position is a property or a literal value, 
respectively. If an operator is polymorphic (allows more than one operand syntax) then each 
permitted syntax MUST be listed separately. 

<D:propdesc><D : like/><D : operand_property/><D : operand_li teral /></D : propdeso 

5.19.9 Example of Query Schema for DAV : basic search 

<D:basicsearchschema xmlns : D= " DAV : " xmlns : t= "urn : uuid : C2F41010-65B3 -lldl -A2 9F- C 
<D :properties> 
<D: propdeso 

<D : propxD : getcontentlength/x /D : prop> 
<D:datatype><t : intx/D : data type > 
<D: searchable/><D: selectable/><D : sortable/ > 
</D:propdesc> 
<D: propdeso 

<D : propxD : ge tcontent type / ><D : displaynamex /D : prop> 
<D: searchable /><D: selectable /> <D: sortable/ > 
</D:propdesc> 
<D: propdeso 

<D:propxJ : f stop/x /D :prop> 
<D : selectable /> 
</D: propdeso 
</D : proper ties> 
<D : operators> 
<D:opdesc> 

<D: isdef ined/xD : operand_property/> 
</D:opdesc> 
<D:opdesc> 

<D: like/xD: operand_property/><D:operand_literal/> 
</D:opdesc> 
</D; operators> 
< /D : basicsearchschema> 

This response lists four properties. The datatype of the last three properties is not given, so it 
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defaults to string. All are selectable, and the first three may be searched. All but the last may be 
used in a sort. Of the optional DAV operators, dav : isde fined and dav : like are supported. 

Note: The schema discovery defined here does not provide for discovery of supported values of 
the dav: casesensi tive attribute. This may require that the reply also list the mandatory 
operators. 

6 Internationalization Considerations 

Clients have the opportunity to tag properties when they are stored in a language. The server 
SHOULD read this language-tagging by examining the xmlrlang attribute on any properties stored 
on a resource. 

The xmlrlang attribute specifies a nationalized collation sequence when properties are compared. 
Comparisons when this attribute differs have undefined order. 

7 Security Considerations 

This section is provided to detail issues concerning security implications of which DASL 
applications need to be aware. All of the security considerations of HTTP/1.1 also apply to DASL. 
In addition, this section will include security risks inherent in searching and retrieval of resource 
properties and content. 

A query must not allow one to retrieve information about values or existence of properties that 
one could not obtain via PROPFIND. (e.g. by use in dav: orderby, or in expressions on 
properties.) 

A server should prepare for denial of service attacks. For example a client may issue a query for 
which the result set is expensive to calculate or transmit because many resources match or must 
be evaluated. 7. 1 Implications of XML External Entities 

XML supports a facility known as "external entities", defined in section 4.2.2 of [REC-XML], 
which instruct an XML processor to retrieve and perform an inline include of XML located at a 
particular URI. An external XML entity can be used to append or modify the document type 
declaration (DTD) associated with an XML document. An external XML entity can also be used 
to include XML within the content of an XML document. For non-validating XML, such as the 
XML used in this specification, including an external XML entity is not required by [REC-XML]. 
However, [REC-XML] does state that an XML processor may, at its discretion, include the 
external XML entity. 

External XML entities have no inherent trustworthiness and are subject to all the attacks that are 
endemic to any HTTP GET request. Furthermore, it is possible for an external XML entity to 
modify the DTD, and hence affect the final form of an XML document, in the worst case 
significantly modifying its semantics, or exposing the XML processor to the security risks 
discussed in [RFC2376]. Therefore, implementers must be aware that external XML entities 
should be treated as untrustworthy. 

There is also the scalability risk that would accompany a widely deployed application which made 
use of external XML entities. In this situation, it is possible that there would be significant 
numbers of requests for one external XML entity, potentially overloading any server which fields 
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requests for the resource containing the external XML entity. 

8 Scalability 

Query grammars are identified by URIs. Applications SHOULD not attempt to retrieve these 
URIs even if they appear to be retrievable (for example, those that begin with "http://") 

9 Authentication 

Authentication mechanisms defined in WebDAV will also apply to DASL. 

10 IANA Considerations 

This document uses the namespace defined by [WebDAV] for XML elements. All other IANA 
considerations mentioned in [WebDAV] also applicable to DASL 

11 Copyright 

To be supplied. 

12 Intellectual Property 

To be supplied. 
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15 APPENDICES 

Three- Valued Logic in DAV:basicsearch 

ANSI standard three valued logic is used when evaluating the search condition (as defined in the 
ANSI standard SQL specifications, for example in ANSI X3. 135-1992, section 8.12, pp. 188-189, 
section 8.2, p. 169, General Rule l)a), etc.). 

ANSI standard three valued logic is undoubtedly the most widely practiced method of dealing 
with the issues of properties in the search condition not having a value (e.g., being null or not 
defined) for the resource under scan, and with undefined expressions in the search condition (e.g., 
division by zero, etc.). Three valued logic works as follows. 
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Undefined expressions are expressions for which the value of the expression is not defined. 
Undefined expressions are a completely separate concept from the truth value UNKNOWN, 
which is, in fact, well defined. Property names and literal constants are considered expressions for 
purposes of this section. If a property in the current resource under scan has not been set to a 
value (either because the property is not defined for the current resource, or because it is null for 
the current resource), then the value of that property is undefined for the resource under scan. 
DASL 1 .0 has no arithmetic division operator, but if it did, division by zero would be an 
undefined arithmetic expression. 

If any subpart of an arithmetic, string, or datetime subexpression is undefined, the whole 
arithmetic, string, or datetime subexpression is undefined. 

There are no manifest constants to explicitly represent undefined number, string, or datetime 
values. 

Since a Boolean value is ultimately returned by the search condition, arithmetic, string, and 
datetime expressions are always arguments to other operators. Examples of operators that convert 
arithmetic, string, and datetime expressions to Boolean values are the six relational operators 
("greater than", "less than", "equals", etc.). If either or both operands of a relational operator have 
undefined values, then the relational operator evaluates to UNKNOWN. Otherwise, the relational 
operator evaluates to TRUE or FALSE, depending upon the outcome of the comparison. 

The Boolean operators dav: and, DAVror and DAVmot are evaluated according to the following 
rules: 

UNKNOWN and UNKNOWN = UNKNOWN 
UNKNOWN or UNKKNOWN = UNKNOWN 
not UNKNOWN = UNKNOWN 
UNKNOWN and TRUE = UNKNOWN 
UNKNOWN and FALSE = FALSE 
UNKNOWN and UNKNOWN = UNKNOWN 
UNKNOWN or TRUE = TRUE 
UNKNOWN or FALSE = UNKNOWN 
UNKNOWN or UNKNOWN = UNKNOWN 
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2.6. Tags 



2.6.1 .Features of Attribute Information 

RGB data conforms to Baseline TIFF Rev. 6.0 RGB Full Color Images, and YCbCr data to TIFF Rev. 
6.0 Extensions YCbCr Images. Accordingly, the parts that follow the TIFF structure must be recorded 
In conformance to the TIFF standard. In addition to the attribute information indicated as mandatory 
in the TIFF standard, this Exif standard adds the TIFF optional tags that can be used in a DSC or 
other system, Exif- specific tags for recording DSC-specific attribute information, and GPS tags for 
recording position information. There are also Exif-original specifications not found in the TIFF 
standard for compressed recording of thumbnails. 

Recording of compressed data differs from uncompressed data in the following respects: 

• When the primary image data is recorded in compressed form, there is no tag indicating the 
primary image itself or its address (pointer), 

• When thumbnail data is recorded in compressed form, address and size are designated using 
Exif-specif ic tags, 

• Tags that duplicate information given in the JPEG Baseline are not recorded (for either primary 
images or thumbnails). 

• Information relating to compression can be recorded using the tags for this purpose. 

2.6.2. IFD Structure 

The IFD used in this standard consists of a 2-byte count (number of fields), 12-byte field 
Interoperability arrays, and 4-byte offset to the next IFD, in conformance with TIFF Rev. 6.0. 

Each of the 12-byte field Interoperability consists of the following four elements respectively. 
Bytes 0-1 Tag 
Bytes 2-3 Type 
Bytes 4-7 Count 
Bytes 8- 1 1 Value Offset 

Each element is explained briefly below. For details see TIFF Rev. 6.0. 



Tag 

Each tag is assigned a unique 2-byte number to identify the field. The tag numbers in the Exif Oth 
IFD and 1st IFD are all the same as the TIFF tag numbers- 



Type 

The following types are used in Exif: 

1 = BYTE An 8-bit unsigned integer., 

2 = ASCII An 8-bit byte containing one 7-bit ASCII code. The final byte is terminated with NULL.. 

3 = SHORT A 16-bit (2-byte) unsigned integer, 

4 = LONG A 32-bit (4-byte) unsigned integer, 

5 = RATIONAL Two LONGs. The first LONG is the numerator and the second LONG expresses the 

denominator., 

7 = UNDEFINED An 8-bit byte that can take any value depending on the field definition, 
9 = SLONG A 32-bit (4-byte) signed integer (2's complement notation), 

10 = SRATIONAL Two SLONGs. The first SLONG is the numerator and the second SLONG is the denominator. 



Count 

The number of values. It should be noted carefully that the count is not the sum of the bytes. In the 
case of one value of SHORT (16 bits), for example, the count is T even though it is 2 bytes. 

Value Offset 

This tag records the offset from the start of the TIFF header to the position where the value itself is 
recorded. In cases where the value fits in 4 bytes, the value itself is recorded. If the value is smaller 
than 4 bytes, the value is stored in the 4-byte area starting from the left, i.e., from the lower end of 
the byte offset area. For example, in big endian format, if the type is SHORT and the value is 1, it is 
recorded as 00010000.H. 

Note that field Interoperability must be recorded in sequence starting from the smallest tag number. 
There is no stipulation regarding the order or position of tag value (Value) recording. 
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2.6.3. Exit-specific IFD 

A. Exif IFD 

Exif IFD is a set of tags for recording Exif-specific attribute information. It is pointed to by the offset 
-from the TIFF header (Value Offset) indicated by an Exif private tag value. 

Exif IFD Pointer 



Tag = 34665 (8769.H) 

Type = LONG 

Count = 1 

Default = none 



A pointer to the Exif IFD. Interoperability, Exif IFD has the same structure as that of the IFD 
specified in TIFF. Ordinarily, however, it does not contain image data as in the case of TIFF. 

B. GPS IFD 

GPS IFD is a set of tags for recording GPS information. It is pointed to by the offset from the TIFF 
header (Value Offset) indicated by a GPS private tag value. 

GPS Info IFD Pointer 



Tag = 34853 (8825.H) 

Type = LONG 

Count = 1 

Default = none 



A pointer to the GPS Info IFD. The Interoperability structure of the GPS Info IFD, like that of Exif 
IFD, has no image data. 

C. Interoperability IFD 

Interoperability IFD is composed of tags which stores the information to ensure the Interoperability 
and pointed by the following tag located in Exif IFD. 

Interoperability IFD Pointer 



Tag =40965 (A005.H) 

Type = LONG 

Count = 1 

Default = None 



The Interoperability structure of Interoperability IFD is same as TIFF defined IFD structure but does 
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not contain the image data characteristically compared with normal TIFF IFD. 
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2. 6.4. TIFF Rev. 6.0 Attribute Information 

Table 3 lists the attribute information used in Exif, including the attributes given as mandatory in 
Baseline TIFF Rev. 6.0 RGB Full Color Images and TIFF Rev. 6.0 Extensions YCbCr Images, as well 
as the optional TIFF tags used by DSC and other systems. The contents are explained below. 
Table 3 TIFF Rev. 6.0 Attribute Information Used in Exif 



Tag Name 



Field Name 



Tag ID 



Dec Hex 



Type 



Count 



A. Tags relating to image data structure 



Image width 
Image height 

Number of bits per component 
Compression scheme 
Pixel composition 
Orientation of image 
Number of components 
Image data arrangement 
Subsampling ratio of Y to C 
Y and C positioning 
Image resolution in width 
direction 

Image resolution in height 
direction 

Unit of X and Y resolution 



ImageWidth 

ImageLength 

BitsPerSample 

Compression 

Photometriclnterpretation 

Orientation 

SamplesPerPixel 

PlanarConfiguration 

YCbCrSubSampling 

YCbCrPositioning 

XResolution 
Y Resolution 

ResolutionUnit 



256 
257 
258 
259 
262 
274 
277 
284 
530 
531 

282 

283 
296 



100 
101 
102 
103 
106 
112 
115 
11C 
212 
213 

11A 

11B 
128 



SHORT or LONG 
SHORT or LONG 

SHORT 

SHORT 

SHORT 

SHORT 

SHORT 

SHORT 

SHORT 

SHORT 

RATIONAL 

RATIONAL 
SHORT 



B. Tags relating to recording offset 



Image data location 
Number of rows per strip 
Bytes per compressed strip 
Offset to JPEG SOI 

of JPEG data 



Bytes 

C. Tags relating to image data characteristics 



StripOffsets 
RowsPerStrip 
StripByteCounts 
jPEGInterchangeFormat 
JPEGInterchangeFormatLength 



273 
278 
279 
513 
514 



111 
116 
117 
201 
202 



SHORT or LONG 
SHORT or LONG 
SHORT or LONG 
LONG 

LONG 



*S 
1 

*S 
1 
1 



Transfer function 
White point chromaticity 
Chromaticities of primaries 
Color space transformation 
matrix coefficients 
Pair of black and white 
reference values 



TransferFunction 
WhitePoint 

PrimaryChromaticities 
YCbCrCoefficients 

ReferenceBlackWhite 



301 
318 
319 

529 
532 



12D 
13E 
13F 

211 
214 



SHORT 
RATIONAL 
RATIONAL 

RATIONAL 
RATIONAL 



3*256 
2 
6 

3 
6 



D. Other tags 



File change date and time 
Image tide 

Image input equipment 
manufacturer 

Image input equipment model 
Software used 

Person who created the image 
Copyright holder 



DateTime 
ImageDescription 

Make 

Model 
Software 
Artist 
Copyright 



306 


132 


ASCII 


270 


10E 


ASCII 


271 


10F 


ASCII 


272 


110 


ASCII 


305 


131 


ASCII 


315 


13B 


ASCII 


3432 


8298 


ASCII 



20 
Any 

Any 

Any 
Any 
Any 
Any 



Chunky format: Strips Perl mage 

Planar format: SamplesPerlmage * StripsPerlmage 

StripsPerlmage = floor((lmageLength + RowsPerStrip -1)/ RowsPerStrip) 
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A. Tags relating to image data structure 
ImageWidth 

The number of columns of image data, equal to the number of pixels per row. In JPEG compressed 
data a JPEG marker is used instead of this tag. 

Tag = 256(1 00. H) 

Type = SHORT or LONG 

Count = 1 

Default = none 

ImageLength 

The number of rows of image data. In JPEG compressed data a JPEG marker is used instead of this 
tag. 

Tag = 257(101.H) 

Type = SHORT or LONG 

Count = 1 

Default = none 

BitsPerSample 

The number of bits per image component. In this standard each component of the image is 8 bits, so 
the value for this tag is 8. See also SamplesPerPixeL In JPEG compressed data a JPEG marker is 
used instead of this tag. 

Tag = 258(102.H) 

Type = SHORT 

Count = 3 

Default =888 

Compression 

The compression scheme used for the image data. When a primary image is JPEG compressed, this 
designation is not necessary and is omitted. When thumbnails use JPEG compression, this tag value 

is set to 6. 

Tag = 259(103.H) 

Type = SHORT 

Count = 1 

Default = none 

1 = uncompressed 

6 = JPEG compression (thumbnails only) 
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Other = reserved 



Photometriclnterpretatdon 

The pixel composition. In JPEG compressed data a JPEG marker is used instead of this tag. 
Tag = 262(106.H) 

Type = SHORT 
Count = 1 
Default = none 

2 = RGB 

6 = YCbCr 

Other = reserved 

Orientation 

The image orientation viewed in terms of rows and columns. 
Tag = 274(112.H) 

Type = SHORT 
Count = 1 
Default = 1 

1 = The Oth row is at the visual top of the image, and the 0th column is the visual left-hand side. 

2 = The Oth row is at the visual top of the image, and the Oth column is the visual right-hand side. 

3 = The Oth row is at the visual bottom of the image, and the Oth column is the visual right-hand 
side. 

4 = The Oth row is at the visual bottom of the image, and the Oth column is the visual left-hand side. 

5 = The Oth row is the visual left-hand side of of the image, and the Oth column is the visual top. 

6 = The Oth row is the visual right-hand side of of the image, and the Oth column is the visual top. 

7 = The Oth row is the visual right-hand side of of the image, and the Oth column is the visual 
bottom. 

8 = The Oth row is the visual left-hand side of of the image, and the Oth column is the visual 
bottom. 

Other = reserved 

SamplesPerPixel 

The number of components per pixel. Since this standard applies to RGB and YCbCr images, the 
value set for this tag is 3. In JPEG compressed data a JPEG marker is used instead of this tag. 

Tag = 277 (11 5. H) 

Type = SHORT 

Count = 1 
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Default = 3 



PlanarConSguration 

Indicates whether pixel components are recorded in chunky or planar format. In JPEG compressed 
files a JPEG marker is used instead of this tag. If this field does not exist, the TIFF default of 1 
(chunky) is assumed. 

Tag = 284(11C.H) 

Type = SHORT 

Count = 1 

1 = chunky format 

2 = planar format 
Other = reserved 

YCbCrSubSampling 

The sampling ratio of chrominance components in relation to the luminance component. In JPEG 
compressed data a JPEG marker is used instead of this tag. 
Tag = 530(212.H) 

Type = SHORT 
Count =2 

[2, 1] = YCbCr4:2:2 
[2,2] = YCbCr4:2:0 
Other = reserved 

YCbCrPositioning 

The position of chrominance components in relation to the luminance component. This field is 
designated only for JPEG compressed data or uncompressed YCbCr data. The TIFF default is 1 
(centered); but when Y:Cb:Cr = 4:2:2 it is recommended in this standard that 2 (co-sited) be used to 
record data, in order to improve the image quality when viewed on TV systems. When this field does 
not exist, the reader shall assume the TIFF default. In the case of Y:Cb:Cr = 4:2:0, the TIFF default 
(centered) is recommended. If the reader does not have the capability of supporting both kinds of 
YCbCrPositioning, it shall follow the TIFF default regardless of the value in this field. It is preferable 
that readers be able to support both centered and co-sited positioning. 

Tag = 531 (213.H) 

Type = SHORT 

Count =1 

Default =1 

1 = centered 
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2 = co-sited 

Other = reserved 



xox xox g x g X 

xox xox g x g X 

xox xox g x eg X 

XOX XOX g} X B£ X 

YCbCrPositioning = 1 YCbCrPositioning = 2 

(centered) (co-sited) 
a) Y:Cb:Cr= 4:2:2 



X X X X 

o o 

X X X X 
X X X X 

o o 

YCbCrPositioning = 1 YCbCrPositioning = 2 

(centered) (co-sited) 
b)Y:Cb:Cr= 4:2:0 

X Luminance Sample 
O Chrominance Sample 

Fig. 8 YCbCrPositioning 





X 




X 


X 


X 


X 


X 




X 




X 



XResolution 

The number of pixels per ResolutionUnit in the ImageWidth direction. When the image resolution is 
unknown, 72 [dpi] is designated. 

Tag = 282(11A.H) 

Type = RATIONAL 

Count = 1 

Default =72 

YResolution 

The number of pixels per ResolutionUnit in the ImageLength direction. The same value as 
XResolution is designated. 

Tag = 283(11B.H) 

Type = RATIONAL 

Count =1 

Default =72 



Resolu tion Unit 

The unit for measuring XResolution and YResolution, The same unit is used for both 



YResolution. If the image resolution in unknown, 2 (inches) is designated. 

Tag = 296(128.H) 

Type = SHORT 

Count = 1 

Default = 2 

2 = inches 

3 = centimeters 
Other = reserved 
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B. Tags relating to recording offset 



StripOfTsets 

For each strip, the byte offset of that strip. It is recommended that this be selected so the number of 
strip bytes does not exceed 64 Kbytes. With JPEG compressed data this designation is not needed and 
- is omitted. See also RowsPerStrip and StripByteCounts. 

Tag = 273(111.H) 

Type = SHORT or LONG 

Count = StripsPerlmage (when PlanarConfiguration = 1 ) 

= SamplesPerPixel * StripsPerlmage (when PlanarConfiguration = 2) 
Default = none 

RowsPerStrip 

The number of rows per strip. This is the number of rows in the image of one strip when an image is 
divided into strips. With JPEG compressed data this designation is not needed and is omitted. See 
also RowsPerStrip and StripByteCounts. 

Tag 

Tag = 278(116.H) 

Type = SHORT or LONG 

Count = 1 

Default = none 

StripByteCounts 

The total number of bytes in each strip. With JPEG compressed data this designation is not needed 
and is omitted. 

Tag = 279(117.H) 

Type = SHORT or LONG 

Count = StripsPerlmage (when PlanarConfiguration = 1) 

SamplesPerPixel * StripsPerlmage (when PlanarConfiguration = 2) 
Default = none 

JPEGIn terch angeForma t 

The offset to the start byte (SOI) of JPEG compressed thumbnail data. This is not used for primary 

image JPEG data. 

Tag = 513 (201. H) 

Type = LONG 
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Default = none 



JPEG In terch angeForm a tLength 

The number of bytes of JPEG compressed thumbnail data. This is not used for primary image JPEG 
data. JPEG thumbnails are not divided but are recorded as a continuous JPEG bitstream from SOI to 
EOI. APPn and COM markers should not be recorded. Compressed thumbnails must be recorded in 
no more than 64 Kbytes, including all other data to be recorded in APP1. 

Tag = 514(202.H) 

Type = LONG 

Default = none 
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C. Tags Relating to Image Data Characteristics 
TransferFunction 

A transfer function for the image, described in tabular style. Normally this tag is not necessary, since 
color space is specified in the color space information tag (ColorSpace). 

Tag = 301 (12D.H) 

Type = SHORT 

Count = 3*256 

Default = none 

WhitePoint 

The chromaticity of the white point of the image. Normally this tag is not necessary, since color space 
is specified in the color space information tag {ColorSpace). 

Tag = 318(13E.H) 

Type = RATIONAL 

Count = 2 

Default = none 

Prim aryChrom a ticities 

The chromaticity of the three primary colors of the image. Normally this tag is not necessary, since 
color space is specified in the color space information tag (ColorSpace). 

Tag = 319(13F.H) 

Type = RATIONAL 

Count =6 

Default = none 

YCbCrCoefficien ts 

The matrix coefficients for transformation from RGB to YCbCr image data. No default is given in 
TIFF; but here the value given in Appendix E, "Color Space Guidelines," is used as the default. The 
color space is declared in a color space information tag, with the default being the value that gives the 
optimal image characteristics Interoperability this condition. 

Tag = 529 (211. H) 

Type = RATIONAL 

Count =3 

Default = See Appendix E. 
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ReferenceBlackWhite 

The reference black point value and reference white point value. No defaults are given in TIFF, but 
the values below are given as defaults here. The color space is declared in a color space information 
tag, with the default being the value that gives the optimal image characteristics Interoperability 

these conditions. 

Tag = 532(214.H) 

Type = RATIONAL 

Count =6 

Default = [0, 255. 0, 255, 0, 255] (when Photometriclnterpretation is RGB) 

[0, 255, 0, 128, 0, 128] (when Photometriclnterpretation is YCbCr) 
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D. Other Tags 
DateTime 

The date and time of image creation. In this standard it is the date and time the file was changed. The 
format is "YYYY:MM:DD HH:MM:SS" with time shown in 24-hour format, and the date and time 
separated by one blank character [20.H]. When the date and time are unknown, all the character 
spaces except colons (":") may be filled with blank characters, or else the Inter op erability field may be 
.filled with blank characters. The character string length is 20 bytes including NULL for termination. 
When the field is left blank, it is treated as unknown. 

Tag = 306(132.H) 

Type = ASCII 

Count = 20 

Default = none 

Im ageDescrip tion 

Acharacter string giving the title of the image. It may be a comment such as "1988 company picnic" or 
the like. Two-byte character codes cannot be used. When a 2-byte code is necessary, the Exif Private 
tag UserComment is to be used. 

Tag = 270 (10E.H) 

Type = ASCII 

Count = Any 
Default = none 

Make 

The manufacturer of the recording equipment. This is the manufacturer of the DSC, scanner, video 
digitizer or other equipment that generated the image. When the field is left blank, it is treated as 

unknown. 

Tag = 271 (10F.H) 

Type = ASCII 

Count = Any 

Default = none 

Model 

The model name or model number of the equipment. This is the model name of number of the DSC, 
scanner, video digitizer or other equipment that generated the image. When the field is left blank, it 
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is treated as unknown. 
Tag = 272(110.H) 

Type = ASCII 
Count = Any 
Default = none 

Software 

This tag records the name and version of the software or firmware of the camera or image input 
device used to generate the image. The detailed format is not specified, but it is recommended that 
the example shown below be followed. When the field is left blank, it is treated as unknown. 

Ex.) "Exif Software Version 1.00a" 
Tag = 305(1 31 h) 

Type = ASCII 
Count = Any 
Default = none 

Artist 

This tag records the name of the camera owner, photographer or image creator. The detailed format is 

not specified, but it is recommended that the information be written as in the example below for ease 

of Interoperability. When the field is left blank, it is treated as unknown. 

Ex.) "Camera owner, John Smith; Photographer, Michael Brown; Image creator, Ken James- 
Tag = 315(13Bh) 
Type = ASCII 
Count = Any 
Default = none 

Copyright 

Copyright information. In this standard the tag is used to indicate both the photographer and editor 
copyrights. It is the copyright notice of the person or organization claiming rights to the image. The 
Interoperability copyright statement including date and rights should be written in this field; e.g., 
"Copyright, John Smith, 19xx. All rights reserved." In this standard the field records both the 
photographer and editor copyrights, with each recorded in a separate part of the statement. When 
there is a clear distinction between the photographer and editor copyrights, these are to be written in 
the order of photographer followed by editor copyright, separated by NULL (in this case, sine* the 
statement also ends with a NULL, there are two NULL codes) (see example 1). When only the 
photographer copyright is given, it is terminated by one NULL code (see example 2). When only the 
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editor copyright is given, the photographer copyright part consists of one space followed by a 
terminating NULL code, then the editor copyright is given {see example 3). When the field xs left 

blank, it is treated as unknown. 

Ex. 1) When both the photographer copyright and editor copyright are given. 

Photographer copyright + NULL[OO.H] + editor copyright + NULL[OO.H] 
Ex. 2) When only the photographer copyright is given. 

Photographer copyright + NULL[00.H] 
Ex. 3) When only the editor copyright is given. 

Space[20.H]+ NULL[OO.H] + editor copyright + NULL[OO.H} 

Tag = 33432 (8298.H) 

Type = ASCII 

Count = Any 

Default = none 
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2.6.5. Exif IFD Attribute Information 

The attribute information (field names and codes) recorded in the Exif IFD is given in Table 4 and 
Table 5 followed by an explanation of the contents. 



Table 4 Exif IFD Attribute Information (1) 



Tag Name Field Name 


Tat 


3 ID 


Type 


Count 


Dec 


Hex 


A. Tags Relating to Version 


36864 
40960 


9000 
A000 


UNDEFINED 
UNDEFINED 


4 
4 




Exif version ExifVersion 
Supported FlashPix version FlashPixVersion 


B.l 


Hag Relating to Image Data Characteristics 


40961 


A001 


SHORT 


1 


Color space information ColorSpace 


C. Tags Relating to Image Configuration 


Of 1 

37122 
40962 
40963 


9101 
9102 
A002 
A003 


UNDEFINED 
RATIONAL 
SHORT or LONG 
SHORT or LONG 


4 
1 
1 
1 




Meaning of each component ComponentsConfiguration 
Image compression mode CompressedBitsPerPixel 
Valid image width PixelXDimension 
Valid image height PixelYDimension 


D. Tags Relating to User Information 


37500 
37510 


927C 
9286 


UNDEFINED 
UNDEFINED 


Any 
Any 




Manufacturer notes MakerNote 
User comments UserComment 


E.l 


rag Relating to Related File Information 


40964 


A004 


ASCII 


13 


Related audio file RelatedSoundFile 


F. Tags Relating to Date and Time 


36867 

36868 

37520 
37521 
37522 


9003 

9004 

9290 
9291 
9292 


ASCII 

ASCII 

ASCII 
ASCII 
ASCII 


20 

20 

Any 
Any 
Any 




Date and time of original data DateTime0 riginal 
generation 

Date and time of digital data Date TlmeDigitized 
generation 

DateTime subseconds SubSecTime 
DateTimeOriginal subseconds SubSecTimeOrigina! 
DateTimeDigitized subseconds SubSecTimeDigitized 


G.- 


rags Relating to Picture-Taking Conditions 










See Table 5 










H.I 


rags Relating to Date and Time 










Pointer of Interoperability IFD Interoperability IFD Pointer 


40965 


AO05 


LONG 


1 
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Table 5 Exit IFD Attribute Information (2) 



G. Taas Relating to Picture-Taking Conditions 












Exposure time 


E xposureTime 


33434 


829A 


RATIONAL 


1 




F number 


FNumDer 


33437 


829D 


RATIONAL 


1 




Exposure program 


E xposu re P rog ram 


34850 


8822 


SHORT 


1 




Spectral sensitivity 


S pectralS ens iti vity 


34852 


8824 


ASCII 


Any 




ISO speed rating 


ISOSpeedRatings 


34855 


8827 


SHORT 


Any 




Optoelectric conversion factor 


OECF 


34856 


8828 


UNDEFINED 


Any 




Shutter speed 


S h utte rS peed Val ue 


37377 


9201 


S RATIONAL 


1 




Aperture 


Aperture Value 


37378 


9202 


RATIONAL 


1 




Brightness 


BrightnessValue 


37379 


9203 


S RATIONAL 


1 




Exposure bias 


ExposureBiasValue 


37380 


9204 


S RATIONAL 


1 




Maximum lens aperture 


MaxApertureValue 


37381 


9205 


RATIONAL 


1 




w 1 1 I'll f-k/'rt /-1 1 + O O fl 

oUujeCl QlSlailCc 


SubjectDistance 


37382 


9206 


RATIONAL 


1 




Metering mode 


MeteringMode 


37383 


9207 


SHORT 


1 




Light source 


LightSource 


37384 


9208 


SHORT 


1 




Flash 


Flash 


37385 


9209 


SHORT 


1 




Lens focal length 


FocalLength 


37386 


920A 


RATIONAL 


1 




Flash energy 


FlashEnergy 


41483 


A20B 


RATIONAL 


1 




Spatial frequency response 


Spatial Frequency Response 


41484 


A20C 


UNDEFINED 


Any 




Focal plane X resolution 


FocalPlaneXResolution 


41486 


A20E 


RATIONAL 


1 




Focal plane Y resolution 


FocaJPIaneYResolution 


41487 


A20F 


RATIONAL 


1 




Focal plane resolution unit 


FocalPlaneResolutionUnit 


41488 


A210 


SHORT 


1 




Subject location 


SubjectLocation 


41492 


A214 


SHORT 


2 




Exposure index 


Exposurelndex 


41493 


A215 


RATIONAL 


1 




Sensing method 


SensingMethod 


41495 


A217 


SHORT 


1 




File source 


FileSource 


41728 


A300 


UNDEFINED 


1 




Scene type 


SceneType 


41729 


A301 


UNDEFINED 


1 




3 FA pattern 


CFAPattem 


41730 


A302 


UNDEFINED 


Any 



-35- 



A. Tags Relating to Version 
ExifVersion 

The version of this standard supported. Nonexistence of this field is taken to mean nonconformance to 
the standard (see section 2.2). Conformance to this standard is indicated by recording "0210" as 4-byte 
ASCII. Since the type is UNDEFINED, there is no NULL for termination. 

Tag = 36864 (9000.H) 

Type = UNDEFINED 

Count = 4 

Default = "0210° 

FlashPix Version 

The FlashPix format version supported by a FPXR file. If the FPXR function supports FlashPix 
format Ver. 1.0, this is indicated similarly to ExifVersion by recording "0100" as 4-byte ASCII. Since 
the type is UNDEFINED, there is no NULL for termination. 
Tag = 40960(A000.H) 

Type = UNDEFINED 

Count = 4 
Default = "01 00" 

01 00 = FlashPix Format Version 1 .0 
Other = reserved 
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B. Tag Relating to Color Space 



ColorSpace 

The color space information tag {ColorSpace) is always recorded as the color space specifier. 
Normally sRGB (=1) is used to define the color space based on the PC monitor conditions and 
environment. If a color space other than sRGB is used, Uncalibrated (=FFFF.H) is set. Image data 
recorded as Uncalibrated can be treated as sRGB when it is converted to FlashPix. On sRGB see 

"Appendix E. 

Tag =40961 (A001.H) 

Type = SHORT 
Count = 1 

1 = sRGB 

FFFF.H = Uncalibrated 

Other = reserved 
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C. Tags Relating to Image Configuration 



PixelXDimension 

Information specific to compressed data. When a compressed file is recorded, the valid width of the 
meaningful image must be recorded in this tag, whether or not there is padding data or a restart 
marker. This tag should not exist in an uncompressed file. For details see section 2.8.1 and Appendix 
F. 

Tag = 40962 (A002.H) 

Type = SHORT or LONG 
Count =1 
Default =none 

PixelYDimension 

Information specific to compressed data. When a compressed file is recorded, the valid height of the 
meaningful image must be recorded in this tag, whether or not there is padding data or a restart 
marker. This tag should not exist in an uncompressed file. For details see section 2.8.1 and Appendix F. 
Since data padding is unnecessary in the vertical direction, the number of lines recorded in this valid 
image height tag will in fact be the same as that recorded in the SOF. 

Tag = 40963 (A003.H) 

Type = SHORT of LONG 

Count = 1 

Compon en ts ConGgura tion 

Information specific to compressed data. The channels of each component are arranged in order from 
the 1st component to the 4th. For uncompressed data the data arrangement is given in the 
Photometriclnterpretation tag. However, since Photometriclnterpretation can only express the order 
of Y,Cb and Cr, this tag is provided for cases when compressed data uses components other than Y, Cb, 
and Cr and to enable support of other sequences. 

Tag = 37121 (9101. H) 

Type = UNDEFINED 

Count = 4 

Default = 4 5 6 0 (if RGB uncompressed) 
1 2 3 0 (other cases) 

0 , = does not exist 

1 = Y 

2 = Cb 

3 = Cr 
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4 = R 

5 = G 

6 = B 
Other = reserved 

Com pressedBitsPerPixel 

Information specific to compressed data. The compression mode used for a compressed image is 
indicated in unit bits per pixel. 

Tag = 37122 (91 02.H) 

Type = RATIONAL 

Count =1 

Default = none 
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D. Tags Relating to User Information 
MakerNote 

A tag for manufacturers of Exif writers to record any desired information. The contents are up to the 

manufacturer. 

Tag = 37500 (927C.H) 

Type = UNDEFINED 

Count = Any 

Default = none 



UserComment 

A tag for Exif users to write keywords or comments on the image besides those in ImageDescription, 
and without the character code limitations of the JmageDescription tag. 

Tag = 37510 (9286.H) 

Type = UNDEFINED 

Count = Any 

Default = none 

The character code used in the UserComment tag is identified based on an ID code in a fixed 8-byte 
area at the start of the tag data area. The unused portion of the area is padded with NULL ("00.H"). 
ID codes are assigned by means of registration. The designation method and references for each 
character code are given in Table 6 . The value of Count N is determined based on the 8 bytes m the 
character code area and the number of bytes in the user comment part. Since the TYPE is not ASCII, 
NULL termination is not necessary (see Fig. 9). 

Table 6 Character Codes and their Designation 



Character Code 



JIS 



Unicode 



Undefined 



Code Designation (8 Bytes) 



41 .H, 53.H, 43.H, 49.H, 49.H. 00.H. 00.H, 00.H 



4A.H. 49.H, 53.H, 00.H, 00.H, 00.H, 00.H, 00.H 



55.H. 4E.H. 49.H, 43.H, 4F.H. 44.H, 45.H, 00.H 



00 H, O0.H. OO.H, 00.H, 00.H. 00.H, 00.H, 00.H 



References 
ITU-T T.50 IA5 X 



JIS X0208-1990* 
Unicode Standard xii 



Undefined 
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Exif Private Tag 



Exit IFD 


ExifVersion 








UserGomment^^-: 

■■ . . - 






Value of 
Exif IFD 




Charac^ 




iUserr^mi^nt ColurnrK 











Fig. 9 User Comment Tag 

The ID code for the UserComment area may be a Defined code such as JIS or ASCII, or may be 
Undefined. The Undefined name is UndefinedText, and the ID code is filled with 8 bytes of all 
"NULL" ("00.H"). An Exif reader that reads the UserComment tag must have a function for 
determining the ID code. This function is not required in Exif readers that do not use the 
UserComment tag (see Table 7). 



Table 7 Implementation of Defined and Undefined Character Codes 



ID Code 


Exif Reader Implementation 


Defined 
(JIS, ASCII, etc.) 


Determines the ID code and displays it in accord with the reader capability. 


Undefined 
(all NULL) 


Depends on the localized PC in each country. (If a character code is used for 
which there is no clear specification like Shift-JIS in Japan, Undefined is used.) 
Although the possibility of unreadable characters exists, display of these 
characters is left as a matter of reader implementation. 



When a UserComment area is set aside, it is recommended that the ID code be ASCII and that the 
following user comment part be filled with blank characters [20.H]. 
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E. Tag Relating to Related File 
RelatedSoundFile 

This tag is used to record the name of an audio file related to the image data. The only relational 
information recorded here is the Exif audio file name and extension (an ASCII string consisting of 8 
characters + V + 3 characters). The path is not recorded. Stipulations on audio are given in section 
3.6.3. File naming conventions are given in section 3.7.1 . 

When using this tag, audio files must be recorded in conformance to the Exif audio format. Writers 
are also allowed to store the data such as Audio within APP2 as FlashPix extension stream data. 
Audio files must be recorded in conformance to the Exif audio format. 

The mapping of Exif image files and audio files is done in any of the three ways shown in Table 8. If 
multiple files are mapped to one file as in [2] or [3] of this table, the above formatis used to record just 
one audio file name. If there are multiple audio files, the first recorded file is given. 
In the case of [3] in Table 8. for example, for the Exif image file "DSC00001.JPG" only 
"SND00001.WAV > ' is given as the related Exif audio file. 

When there are three Exif audio files "SND00001.WAV", "SND00002.WAV and "SND00003.WAV", 
the Exif image file name for each of them, "DSC00001.JPG," is indicated. By combining multiple 
relational information, a variety of playback possibilities can be supported. The method of using 
relational information is left to the implementation on the playback side. Since this information is an 
ASCII character string, it is terminated by NULL. 

Table 8 Mapping between Image and Audio Files 




When this tag is used to map audio files, the relation of the audio file to image data must also be 
indicated on the audio file end. 

Tag = 40964 (A004.H) 

Type = ASCII 
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Count = 13 
Default = none 
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F. Tags Relating to Date and Time 
DateTimeOriginal 

The date and time when the original image data was generated. For a DSC the date and time the 
picture was taken are recorded. The format is M YYYY:MM:DD HH:MM:SS" with time shown in 24- 
hour format, and the date and time separated by one blank character [20.H]. When the date and time 
are unknown, all the character spaces except colons (":") may be filled with blank characters, or else 
the Interoperability field may be filled with blank characters. The character string length is 20 bytes 
including NULL for termination. When the field is left blank, it is treated as unknown. 

Tag = 36867 (9003.H) 

Type = ASCII 

Count =20 

Default = none 

DateTbneDigitized 

The date and time when the image was stored as digital data. If, for example, an image was captured 
by DSC and at the same time the file was recorded, then the DateTimeOriginal and 
DsteTimeDigitized will have the same contents. The format is "YYYY:MM:DD HH:MM:SS» with time 
shown in 24-hour format, and the date and time separated by one blank character [20.H]. When the 
date and time are unknown, all the character spaces except colons (":") may be filled with blank 
characters, or else the Interoperability field may be filled with blank characters. The character string 
length is 20 bytes including NULL for termination. When the field is left blank, it is treated as 

unknown. 

Tag = 36868 (9004.H) 

Type = ASCII 

Count = 20 

Default = none 

SubsecTime 

A tag used to record fractions of seconds for the DateTime tag. 
Tag = 37520 (9290.H) 

Type = ASCII 
Count = Any 
Default = none 
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SubsecTimeOriginal 

A tag used to record fractions of seconds for the DateTimeOriginal tag. 
Tag = 37521 (9291. H) 

Type = ASCII 

N = Any 

Default = none 

* 

SubsecTimeDigitized 

A tag used to record fractions of seconds for the DateTimeDigitized tag. 
Tag = 37522 (9292.H) 

Type = ASCII 

N = Any 

Default = none 

Note: Recording 8 ub 8 econd data (SubsecTime, SubsecTimeOriginal, SubsecTimeDigitized) 
The tag type is ASCII and the string length including NULL is variable length. When the number of 
valid digits is up to the second decimal place, the subsecond value goes in the Value position. When it 
is up to four decimal places, an address value is Interoperability, with the subsecond value put m the 
location pointed to by that address. (Since the count of ASCII type field Interoperability is a value that 
includes NULL, when the number of valid digits is up to four decimal places the count is 5, and the 
offset value goes in the Value Offset field. See section 2.6.2.) Note that the subsecond tag differs from 
the DateTime tag and other such tags already defined in TIFF Rev. 6.0, and that both are recorded in 

theExiflFD. 

Ex.: September 9, 1998, 9:15:30.130 
(the number of valid digits is up to the third decimal place) 
DateTime 1996:09:01 09:15:30 [NULL] 
SubSecTime 130 [NULL] 

If the string length is longer than the number of valid digits, the digits are aligned with the start of 
the area and the rest is filled with blank characters [20.H]. If the subsecond data is unknown, the 
Interoperability area can be filled with blank characters. 
Examples when subsecond data is 0. 1 30 seconds: 

Ex. 1) T,'370\[NULL] 

Ex.2) , 1',"3 , , , 0 , .[20.H].[NULL] 

Ex. 3) '1 7370', [20.H], [20.H], [20.H], [20.H], [20.H], [NULL] 
Example when subsecond data is unknown: 

Ex. 4) [20.H], [20.H], [20.H], [20.H], [20.H], [20.H], 120.HU20.H], [NULL] 
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G. Tags Relating to Picture-Taking Conditions 
ExposureTime 

Exposure time, given in seconds (sec). 
Tag = 33434 (829A.H) 

Type = RATIONAL 

Count =1 
Default = none 

ShutterSpeedValue 

Shutter speed. The unit is the APEX (Additive System of Photographic Exposure) setting (see 

Appendix C). 

Tag = 37377 (9201. H) 

Type = S RATIONAL 

Count = 1 

Default = none 

ApertureValue 

The lens aperture. The unit is the APEX value. 
Tag = 37378 (9202.H) 

Type = RATIONAL 

Count = 1 
Default = none 

Brightness Value 

The value of brightness. The unit is the APEX value. Ordinarily it is given in the range of -99.99 to 
99.99. 

Tag = 37379 (9203.H) 

Type = S RATIONAL 

Count =1 
Default = none 

ExposureBiasValue 

The exposure bias. The unit is the APEX value. Ordinarily it is given in the range of -99.99 to 99.99. 
Tag = 37380 (9204.H) 

Type = S RATIONAL 
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Count = 1 
Default = none 

MaxApertureValue 

The smallest F number of the lens. The unit is the APEX value. Ordinarily it is given in the range of 
00.00 to 99.99, but it is not limited to this range. 

Tag = 37381 (9205.H) 

Type = RATIONAL 

Count =1 

Default = none 

SubjectDistance 

The distance to the subject, given in meters. 
Tag = 37382 (9206.H) 

Type = RATIONAL 
Count =1 
Default = none 

MeteringMode 

The metering mode. 

Tag = 37383 (9207. H) 

Type = SHORT 
Count =1 
Default = 0 

0 = unknown 



1 



= Average 



2 



= CenterWeightedAverage 



3 



= Spot 



4 



= MultiSpot 



5 



= Pattern 



6 



= Partial 



7 to 254 



= reserved 



255 



= other 



LdghtSource 



The kind of light source. 

Tag = 37384 (9208. H) 
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Type = SHORT 

Count = 1 

Default = 0 

0 = unknown 

1 = Daylight 

2 = Fluorescent 

3 = Tungsten 

17 = Standard light A 

18 = Standard light B 
ig = Standard light C 

20 = D55 

21 = D65 

22 = D75 

23 to 254 = reserved 
255 = other 



flash 

This tag is recorded when an image is taken using a strobe light (flash). Bit 0 indicates the flash firing 
status, and bits 1 and 2 indicate the flash return status <see Fig. 1 0). 
mrr LSB 















7 


6 


5 


4 


3 





Flash firing status 
> Flash return status 



Fig. 10 Bit Coding of the Flash Tag 

Tag = 37385 (9209.H) 

Type = SHORT 

Count = 1 

Values for bit 0 indicating whether the flash fired. 

Ob = Flash did not fire. 

1b = Flash fired. 

Values for bits 1 and 2 indicating the status of returned light. 

00b = No strobe return detection function 

01b = reserved 

10b = Strobe return light not detected. 
11b = Strobe return light detected. 
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Resulting Flash tag values. 

0000. H = Flash did not fire. 

0001. H = Flash fired. 

. 0005.H = Strobe return light not detected. 
0007.H = Strobe return light detected. 
Other = reserved 

FocalLength 

.The actual focal length of the lens, in mm. Conversion is not made to the focal length of a 35 mm film 



camera. 
Tag 
Type 
Count 



= 37386 (920A.H) 
= RATIONAL 
= 1 



Default = none 



FNumber 

The F number. 

Tag = 33437 (829D.H) 

Type = RATIONAL 
Count = 1 
Default = none 

Expos ureProgram 

The class of the program used by the camera to set exposure when the picture is taken. The tag values 
are as follows. 

Tag = 34850 (8822.H) 

Type = SHORT 

Count = 1 

Default = 0 

0 = Not defined 

1 = Manual 

2 = Normal program 

3 = Aperture priority 

4 = Shutter priority 

5 = Creative program (biased toward depth of field) 

6 = Action program (biased toward fast shutter speed) 

7 = Portrait mode (for closeup photos with the background out of focus) 
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Landscape mode (for landscape photos with the background in focus) 
reserved 

SpectralSensitivity 

Indicates the spectral sensitivity of each channel of the camera used. The tag value is an ASCII string 
compatible with the standard"* developed by the ASTM Technical committee. 

Tag = 34852 (8824.H) 

Type = ASCII 

Count = Any 

Default = none 

ISOSpeedRatings 

Indicates the ISO Speed and ISO Latitude of the camera or input device as specified in ISO 12232 X1V . 
Tag = 34855 (8827.H) 

Type = SHORT 

Count = Any 
Default = none 

OECF 

Indicates the Opto-Electoric Conversion Function (OECF) specified in ISO 14524 xv . OECF is the 
relationship between the camera optical input and the image values. 

Tag = 34856 (8828.H) 

Type = UNDEFINED 

Count = ANY 

Default = none 

When this tag records an OECFoi m rows and n columns, the values are as in Fig. 11 . 



Length Type Meaning 



L.C7I IV|" 1 

2 


SHORT 


Columns = n 


2 


SHORT 


Rows = m 


Anv 


ASCII 


Oth column item name (NULL terminated) 








Any 


ASCII 


n-1th column item name (NULL terminated) 


8 


S RATIONAL 


OECF value ro.Ol 








8 


S RATIONAL 


OECF value fn-1.01 


8 


SRATIONAL 


OECF value f0.m-1l 








8 


SRATIONAL 


OECF value fn-1.m-1] 



Fig. 11 OECF Description 



8 

9 to 255 = 



-50- 



Table 9 gives a simple example. 



Table 9 Example of Exposure and RGB Output Level 



Camera loa Aperture 


R Output Level 


G Output Level 


B Output Level 


-3.0 


10.2 


12.4 


8.9 


-2.0 


48.1 


47.5 


48.3 


-1.0 


150.2 


152.0 


149.8 



FlashEnergy 

Indicates the strobe energy at the time the image is captured, as measured in Beam Candle Power 
Seconds (BCPS). 

Tag =41483 (A20B.H) 

Type = RATIONAL 

Count = 1 

Default = none 

SpatialFrequencyResponse 

This tag records the camera or input device spatial frequency table and SFR values in the direction of 
image width, image height, and diagonal direction, as specified in ISO 12233™. 

Tag = 41484 (A20CH) 

Type = UNDEFINED 

Count = ANY 

Default = none ( 
When the spatial frequency response for m rows and n columns is recorded, the values are as shown 
in Fig. 12. 



Length Type Meaning 



2 


SHORT 


Columns = n 


2 


SHORT 


Rows = m 


Any 


ASCII 


0th column item name (NULL terminated) 








Any 


ASCII 


n-1th column item name (NULL terminated) 


8 


RATIONAL 


SFR value [0.01 








8 


RATIONAL 


SFR value Jn-1.01 


8 


RATIONAL 


SFR value f0.m-1l 








8 


RATIONAL 


SFR value [n-1.m-1l 



Fig. 12 Spatial Frequency Response Description 
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Table 10 gives a simple example. 

Table 10 Example of Spatial Frequency Response 




FocalPlaneXResolution 

Indicates the number of pixels in the image width (X) direction per FocalPlaneResolutionUnit on the 

camera focal plane. 

Tag « 41486 (A20E.H) 

Type = RATIONAL 
Count = 1 
Default = none 

FocalPlaneYResolution 

Indicates the number of pixels in the image height (Y) direction per FocalPlaneResolutionUnit on the 
camera focal plane. 
Tag 
Type 
Count 



= 41487 (A20F.H) 
= RATIONAL 
= 1 



Default 



= none 



FocalPlaneResolutionUnit 

Indicates the unit for measuring FocalPlaneXResolution and FocalPlaneYResolution. This value is 
the same as the ResolutionUnit. 

Tag = 41488 (A210.H) 

Type = SHORT 

Count =1 

Default = 2 0nch) 

Note on use of tags concerning focal plane resolution 

These tags record the actual focal plane resolutions of the main image which is written as a file after 
processing instead of the pixel resolution of the image sensor in the camera. It should be noted 
carefully that the data from the image sensor is resampled. 

These tags are used at the same time as a FocalLength tag when the angle of field of the recorded 
image is to be calculated precisely. 
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SubjectLoca tion 

Indicates the location of the main subject in the scene. The value of this tag represents the pixel at 
the center of the main subject relative to the left edge, prior to rotation processing as per the Rotation 
tag. The first value indicates the X column number and second indicates the Y row number. 

Tag = 41492 (A214.H) 

Type = SHORT 

Count = 2 

Default = none 

Exposurelndex 

Indicates the exposure index selected on the camera or input device at the time the image is captured. 

Tag 
Type 
Count 



= 41493 (A215.H) 
= RATIONAL 
= 1 



Default = none 
SensingMethod 



Indicates the image sensor type on the camera or input device. The values are as follows. 



Tag = 


41495 (A217.H) 


Type 


SHORT 


Count = 


1 


Default = 


none 


1 


= Not defined 


2 


= One-chip color area sensor 


3 


ss Two-chip color area sensor 


4 


= Three-chip color area sensor 


5 


= Color sequential area sensor 


7 


= Trilinear sensor 


8 


= Color sequential linear sensor 


Other 


= reserved 



FileSource 

Indicates the image source. If a DSC recorded the image, this tag value of this tag always be set to 3, 
indicating that the image was recorded on a DSC. 

Tag = 41728 (A300.H) 

Type = UNDEFINED 

Count = 1 
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Default =3 

3 = DSC 

Other = reserved 

SceneType 

indices the type of scene. If a DSC recorded «•« image, this tae value must ah-aye be set to 1, 

indicating that the image was directly photographed. 
Tag = 41729 (A301.H) 

Type = UNDEFINED 
Count = 1 
Default = 1 

1 = A directly photographed image 

Other = reserved 
CFAPattern 

Indicates the color filter array (CPA) geometric pattern of the image sensor when a one-chip color 
area sensor is used. It does not apply to all sensing methods. 

Tag = 41730 (A302.H) 

Type = UNDEFINED 

Count = ANY 

Fig. 13 shows how a CFA pattern is recorded for a one-chip color area sensor when the color filter 
array is repeated in m x n (vertical x lateral) pixel units. 



Length 



Type 



SHORT 



Meaning 

Horizontal repeat pixel unit = n 



2 
2 



SHORT 



Vertical repeat pixel unit = m 



BYTE 



CFA value fO.01 



CFA value fn-1.0] 
CFA value fO.m-1] 



BYTE 



BYTE 



CFA value rn-1.m-1] 



Fig. 13 CFA Pattern Description 
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The relation of color filter color to CFA value is shown in Table 11 . 



Table 11 Color Filter Color and CFA Value 



Filter Color 


CFA Value 


RED 


00.H 


GREEN 


01. H 


BLUE 


02. H 


CYAN 


03.H 


MAGENTA 


04.H 


YELLOW 


05.H 


WHITE 


06.H 



For example, when the CFA pattern values are {0002.H, 0002.H, 01.H, 00.H, 02.H, 01.H}, the color 
filter array is as shown in Fig. 14. 



G 


R 


G 


R 




B 


G 


B 


G 




G 


R 


G 


R 




B 


G 


B 


G 















Fig. 14 Color Filter Array 
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2.6.6. GPS Attribute Information 

The attribute information (field names and codes) recorded in the GPS Info IFD is *iven in Table 12 
followed by an explanation of the contents. 

Table 12 GPS Attribute Information 



Tag Name 



Field Name 



A. Tags Relating to GPS 



GPS tag version 
North or South Latitude 
Latitude 

East or West Longitude 
Longitude 
Altitude reference 
Altitude 

GPS time (atomic clock) 

GPS satellites used for measurement 

GPS receiver status 

GPS measurement mode 

Measurement precision 

Speed unit 

Speed of GPS receiver 
Reference for direction of movement 
Direction of movement 
Reference for direction of image 
Direction of image 
Geodetic survey data used 
Reference for latitude of destination 
Latitude of destination 
Reference for longitude of destination 
Longitude of destination 
Reference for bearing of destination 
Bearing of destination 
Reference for distance to destination 
Distance to destination 



GPSVersionID 

GPSLatitudeRef 

GPSLatitude 

GPSLongitudeRef 

GPSLongitude 

GPSAItitudeRef 

GPS Altitude 

GPSTlmeStamp 

GPSSatellites 

GPSStatus 

GPSMeasureMode 

GPSDOP 

GPSSpeedRef 

GPSSpeed 

GPSTrackRef 

GPSTrack 

GPSImgDirectionRef 
GPSImgDirection 
GPSMapDatum 
GPSDestLatitudeRef 
GPSDestLatitude 
GPSDestLongitudeRef 
GPSDestLongitude 
GPSDestBearingRef 
GPSDestB earing 
G PS DestDistanceR ef 
GPSPestPistance 



Tag ID 



Dec Hex 



0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 



0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

A 

B 

C 

D 

E 

F 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

1A 



Type 



BYTE 

ASCII 
RATIONAL 

ASCII 
RATIONAL 

BYTE 
RATIONAL 
RATIONAL 

ASCII 

ASCII 

ASCII 
RATIONAL 

ASCII 
RATIONAL 

ASCII 
RATIONAL 

ASCII 
RATIONAL 

ASCII 

ASCII 
RATIONAL 

ASCII 
RATIONAL 

ASCII 
RATIONAL 

ASCII 
RATIONAL 



Count 



4 
2 
3 
2 
3 
1 
1 
3 
Any 
2 
2 
1 
2 
1 
2 
1 
2 
1 

Any 
2 
3 
2 
3 
2 
1 
2 
1 
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A. Tags Relating to GPS 
GPSVersionID 

Indicates the version of GPSInfoIFD. The version is given as 2.0.0.0. This tag is mandatory when 
GPSInfo tag is present. (Note: The GPSVersionID tag is given in bytes, unlike the ExifVersion tag. 
When the version is 2.0.0.0, the tag value is 02000000.H.) 
Tag = 0(0.H) 

Type = BYTE 

Count = 4 
Default = 2.0.0.0 

2.0.0.0 = Version 2.0 
Other = reserved 

GPSLatitudeRef 

Indicates whether the latitude is north or south latitude. The ASCII value W indicates north latitude, 
and *S' is south latitude. 
Tag = 1 O.H) 

Type = ASCII 

Count = 2 
Default = none 

»N* = North latitude 

'S' = South latitude 

Other = reserved 

GPSLatitude 

Indicates the latitude. The latitude is expressed as three RATIONAL values giving the degrees, 
minutes, and seconds, respectively. When degrees, minutes and seconds are expressed, the format is 
dd/l,mm/l,ss/l. When degrees and minutes are used and, for example, fractions of minutes are given 
up to two decimal places, the format is dd/l,mmmm/100,0/l. 

Tag = 2(2.H) 

Type = RATIONAL 

Count = 3 

Default = none 

GPSLongitudeRef 

Indicates whether the longitude is east or west longitude. ASCII V indicates east longitude, and 'W 
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is west longitude. 

Tag = 3<3.H) 

Type = ASCII 

Count = 2 

Default = none 

'E' = East longitude 

W' = West longitude 

Other = reserved 

GPSLongitude 

Indicates the longitude. The longitude is expressed as three RATIONAL values giving the degrees, 
minutes, and seconds, respectively. When degrees, minutes and seconds are expressed, the format is 
ddd/l,mm/l,ss/l. When degrees and minutes are used and, for example, fractions of minutes are given 
up to two decimal places, the format is ddd/l,mmmm/100,0/l. 

Tag = 4(4.H) 

Type = RATIONAL 

Count = 3 

Default = none 

GPSAltdtudeRef 

Indicates the altitude used as the reference altitude. In this version the reference altitude is sea level, 
so this tag must be set to 0. The reference unit is meters. Note that this tag is BYTE type, unlike other 
reference tags. 

Tag = 5(5.H) ' 

Type = BYTE 
Count =1 
Default =0 

0 = Sea level 

Other = reserved 

GPSAltitude 

Indicates the altitude based on the reference in GPSAltitudeRef. Altitude is expressed as one 
RATIONAL value. The reference unit is meters. 

Tag = 6(6.H) 

Type = RATIONAL 

Count =1 

Default = none 
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GPSTimeStamp 

Indicates the time as UTC (Coordinated Universal Time). TimeStamp is expressed as three 
RATIONAL values giving the hour, minute, and second. 

Tag = 7(7.H) 

Type = RATIONAL 

Count =3 

Default = none 

GPSSatellites 

Indicates the GPS satellites used for measurements. This tag can be used to describe the number of 
satellites, their ID number, angle of elevation, azimuth, SNR and other information in ASCII notation. 
The format is not specified. If the GPS receiver is incapable of taking measurements, value of the tag 
must be set to NULL. 

Tag = 8 (8.H) 

Type = ASCII 

Count = Any 

Default = none 

GPSStatus 

Indicates the status of the GPS receiver when the image is recorded. 'A' means measurement is in 
progress, and ' V means the measurement is Interoperability. 
Tag = 9(9.H) 

Type = ASCII 
Count = 2 
Default = none 

'A' = Measurement in progress 

V = Measurement Interoperability 

Other = reserved 



GPSMeasureMode 

Indicates the GPS measurement mode. *2' means two-dimensional measurement and '3' means three- 
dimensional measurement is in progress. 



Tag = "lO(A.H) 

Type = ASCII 

Count = 2 

Default = none 

*2' = 2-dimensional measurement 
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'3' 

Other 



= 3-dimensional measurement 
= reserved 



GPSDOP 

Indicates the GPS DOP (data degree of precision). An HDOP value is written during two-dimensional 
measurement, and PDOP during three-dimensional measurement. 

Tag = H(B.H) 

Type = RATIONAL 

Count =1 

Default = none 



GPSSpeedRef 

Indicates the unit used to express the GPS receiver speed of movement. , K t 'M' and , N t represents 
kilometers per hour, miles per hour, and knots. 



Tag 
Type 
Count 
Default 

K 1 

M 

'N' 

Other 



12(C.H) 

ASCII 

2 

•K* 

= Kilometers per hour 
= Miles per hour 
= Knots 
= reserved 



GPSSpeed 

Indicates the speed of GPS receiver movement. 
Tag = 13(D.H) 

Type = RATIONAL 
Count =1 
Default = none 



GPSTrackRef 

Indicates the reference for giving the direction of GPS receiver movement. T denotes true direction 
and 'IvV is magnetic direction. 

Tag = 14(E.H) 

Type = ASCII 

Count = 2 

Default = T 
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T 
M 

Other 



= True direction 
= Magnetic direction 
= reserved 



GPSTrack 

Indicates the direction of GPS receiver movement. The range of values is from 0.00 to 359.99. 
Tag = 15(F.H) 

Type = RATIONAL 

Count =1 
Default = none 

GPSImgDirectionRef 

Indicates the reference for giving the direction of the image when it is captured. T' denotes true 
direction and 'M* is magnetic direction. 



Tag 
Type 
Count 
Default 

T 

M 

Other 



16(10.H) 

ASCII 

2 

T 

= True direction 
= Magnetic direction 
= reserved 



GPSImgDirection 

Indicates the direction of the image when it was captured. The range of values is from 0.00 to 359.99. 
Tag = 17(11.H) 

Type = RATIONAL 

Count =1 
Default = none 



GPSMapDatum 

Indicates the geodetic survey data used by the GPS receiver. If the survey data is restricted to Japan, 
the value of this tag is TOKYO 1 or *WGS-84\ If a GPS Info tag is recorded, it is strongly recommended 
that this tag be recorded. 

Tag = 18 0I2.H) 

Type = ASCII 

Count = Any 

Default = none 
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GPSDestLatitudeRef 

Indicates whether the latitude of the destination point is north or south latitude. The ASCII value 'N' 
indicates north latitude, and 'S' is south latitude. 
Tag = 19(13.H) 

Type = ASCII 
Count = 2 
Default = none 

■N" = North latitude 

'S' = Sourth latitude 

Other = reserved 

GPSDestLatitude 

Indicates the latitude of the destination point. The latitude is expressed as three RATIONAL values 
giving the degrees, minutes, and seconds, respectively. When degrees, minutes and seconds are 
expressed, the format is dd/l,mm/l,ss/l. When degrees and minutes are used and, for example, 
fractions of minutes are given up to two decimal places, the format is dd/l,mmmm/100,0/l. 

Tag = 20(14.H) 

Type = RATIONAL 

Count = 3 

Default = none 

GPSDestLongitudeRef 

Indicates whether the longitude of the destination point is east or west longitude. ASCII 'E* indicates 
east longitude, and W is west longitude. 
Tag = 21 (15.H) 

Type = ASCII 
Count =2 
Default = none 

'E' = East longitude 

W* = West longitude 

Other = reserved 

GPSDestLongitude 

Indicates the longitude of the destination point. The longitude is expressed as three RATIONAL values 
giving the degrees, minutes, and seconds, respectively. When degrees, minutes and seconds are 
expressed, the format is ddd/l,mm/l,ss/l. When degrees and minutes are used and, for example, 
fractions of minutes are given up to two decimal places^ the format is ddd/l,mmmm/100,0/l. 
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Tag = 22(16.H) 

Type = RATIONAL 

Count = 3 

Default = none 



GPSDestBearingRef 

Indicates the reference used for giving the bearing to the destination point. T" denotes true direction 
and 'M' is magnetic direction. 
Tag = 23(17.H) 

Type = ASCII 

Count = 2 
Default = T 

T = True direction 

M = Magnetic direction 

Other = reserved 

GPSDestBearing 

Indicates the bearing to the destination point. The range of values is from 0.00 to 359.99. 
Tag = 24(18.H) 

Type = RATIONAL 
Count = 1 
Default = none 

GPSDestDistanceRef 

Indicates the unit used to express the distance to the destination point. *K\ 'M' and 'NT represent 
kilometers, miles and knots. 



Tag = 


25 (19.H) 


Type 


ASCII 


Count = 


2 


Default 


'K* 


K' 


= Kilometers 


M 


= Miles 


'N* 


= Knots 


Other 


= reserved 



GPSDestDistance 

Indicates the distance to the destination point. 
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Tag = 26(1A.H) 

Type = RATIONAL 

Count = 1 

Default = none 

Note: When the tag Type is ASCII, it must be terminated with NULL. 

It must be noted carefully that since the value count includes the terminator NULL, the total count is 
the number of data+1. For example, GPSLatitudeRef cannot have any values other than Type ASCII 
'N' or 'S'; but because the terminator NULL is added, the value of N is 2. 
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2.6.7. Interoperability IFD Attribute Information 

The attached information (field name, code) stored in Interoperability IFD is listed in Table 12-2 and 
the meaning will be explained also. 



Table 13 Interoperability IFD Attribute Information 



Tag Name F,e,d Name 


Tag ID 


Type 


Count 


Dec 


Hex 


A. Attached Information Related to Interoperability 


0 


0 


ASCII 


Any 


| Interoperability Identification Interoperability Index 



A. Tags Relating to Interoperability 

The rules for Exif image files defines the description of the following tag. Other tags stored in 
Interoperability IFD may be defined dependently to each Interoperability rule. 

Interoperability Index 

Indicates the identification of the Interoperability rule. 

Use "R98" for stating ExifR98 Rules when using intreoperability rules recommended in Appendix D. 
Four bytes used including the termination code (NULL). See the separate volume of Recommended 
Exif Interoperability Rules (ExifR98) for other tags used for ExifR98. 

Tag = 1 (1.H) 

Type = ASCII 

Count = Any 

Default = none 
m R98" = Recommended Exif Interoperability Rules (ExifR98) 
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2.6.8. Tag Support Levels 

The tags and their support levels are given here. 



A. Primary Image (Oth IFD) Support Levels 

The support levels of primary image (Oth IFD) tags are given in Table 14, Table 15, Table 16 and Table 
17. 

Table 1 4 Tag Support Levels (1 ) - Oth IFD TIFF Tags - 



Tag Name 



Field Name 



Tag ID 


Uncompressed 


Compresse 
d 


Dec 


Hex 


Chunky 


Planar 


TOO 


256 


100 


M 


M 


M 


J 


257 


101 


M 


M 


M 


J 


258 


102 


M 


M 


M 


J 


259 


103 


M 


M 


M 


J 


262 


106 


M 


%A 

M 


KA 
IVI 


j 


270 


10E 


R 


R 


R 


R 


271 


10F 


R 


R 


R 


R 


272 


110 


R 


R 


R 


Q 


273 


111 


M 


M 


M 


N 


274 


112 


R 


R 


R 


R 


277 


115 


M 


M 


M 


J 


278 


116 


M 


M 


M 


N 


279 


117 


M 


M 


M 


N 


282 


11A 


M 


M 


M 


M 


283 


11B 


M 


M 


M 


M 


284 


11C 


O 


M 


O 


J 


296 


128 


M 


M 


M 


M 


301 


12D 


R 


R 


R 


R 


305 


131 


O 


O 


O 


O 


306 


132 


R 


R 


R 


R 


315 


13B 


O 


O 


O 


O 


318 


13E 


O 


O 


O 


O 


319 


13F 


O 


O 


O 


o 


513 


201 


N 


N 


N 


N 


514 


202 


N 


N 


N 


N 


529 


211 


N 


N 


O 


O 


530 


212 


N 


N 


M 


J 


531 


213 


N 


N 


M 


M 


532 


214 


O 


O 


O 


O 


33432 


8298 


O 


O 


O 


O 


34665 


8769 


M 


M 


M 


M 


34853 


8825 


O 


O . 


O 


O 



Image width 
Image height 

Number of bits per component 
Compression scheme 
Pixel composition 
Image title 

Manufacturer of image input 
equipment 

Model of image input equipment 

Image data location 

Orientation of image 

Number of components 

Number of rows per strip 

Bytes per compressed strip 

Image resolution in width direction 

Image resolution in height direction 

Image data arrangement 

Unit of X and Y resolution 

Transfer function 

Software used 

File change date and time 

Person who created the image 

White point chromaticity 

Chromaticities of primaries 

Offset to JPEG SOI 

Bytes of JPEG data 

Color space transformation matrix 

coefficients 

Subsampling ratio of Y to C 

Y and C positioning 

Pair of black and white reference 

values 

Copyright holder 
Exif tag 

GPS tag 



ImageWidth 

ImageLength 

BitsPerSample 

Compression 

Photometriclnterpretation 

ImageDescription 

Make 

Model 

StripOffsets 

Orientation 

SamplesPerPixel 

RowsPerStrip 

StripByteCounts 

X Resolution 

Y Resolution 

PlanarConfiguration 

ResolutionUnit 

TransferFunction 

Software 

DateTime 

Artist 

WhitePoint 

PrimaryChromaticities 

J PEG I nterchangeFormat 

JPEGInterchangeFormatLength 

YCbCrCoefficients 

YCbCrSubSampling 
YCbCrPositioning 

ReferenceBlackWhite 

Copyright 

Exit IFD Pointer 

GPSInfo IFD Pointer 



Notation 

M : Mandatory (must be recorded) 

R : Conditionally mandatory (must be recorded if hardware permits) 
O : Optional 
N : Not recorded 

J : Included in JPEG marker and so not recorded 
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Table 1 5 Tag Support Levels (2) - Oth IFD Exif Private Tags - 



Tag ID 


Uncompressed 


Compres 
sed 


Dec 


Hex 


Chunky 


Planar 


YCC 


33434 
33437 


829A 
829D 


O 
O 


O 
O 


O 
O 


L> 
O 


34850 


8822 


O 


O 


O 


O 


34852 


8824 


o 


O 


O 


o 


34855 


8827 


o 


O 


O 


o 


34856 


8828 


o 


o 


O 


o 


36864 


9000 


M 


M 


M 


M 


36867 


9003 


O 


O 


O 


O 


36866 


9004 


V 


. u 


r\ 
\J 


o 


37121 


9101 


N 


N 


N 


M 


37122 


9102 


N 


N 


N 


o 


37377 


9201 


O 


O 


O 


o 


37378 


9202 


O 


O 


O 


o 


37379 


9203 


O 


o 


O 


o 


37380 


9204 


O 


o 


O 


o 


37381 


9205 


O 


o 


O 


o 


37382 


9206 


O 


o 


O 


o 


37383 


9207 


O 


o 


O 


o 


37384 


9208 


O 


o 


O 


o 


37385 


9209 


O 


o 


O 


o 


37386 


920A 


O 


o 


o 


o 


37500 


927C 


O 


o 


o 


o 


37510 


9286 


O 


o 


o 


o 


37520 


9290 


O 


o 


o 


o 


37521 


9291 


O 


o 


o 


o 


37522 


9292 


O 


o 


o 


o 


40960 


A000 


M 


M 


M 


M 


40961 


A001 


M 


M 


M 


M 


40962 


A002 


N 


N 


N 


M 


40963 


A003 


N 


N 


N 


M 


40964 


A004 


O 


O 


O 


o 


40965 


A005 


N 


N 


N 


o 


41483 


A20B 


O 


O 


O 


o 


41484 


A20C 


O 


O 


O 


o 


41486 


A20E 


O 


O 


O 


o 


41487 


A20F 


o 


O 


O 


o 


41488 


A210 


o 


O 


O 


o 


41492 


A214 


o 


O 


o 


o 


41493 


A215 


o 


O 


o 


o 


41495 


A217 


o 


O 


o 


o 


41728 


A300 


o 


O 


o 


o 


41729 


A301 


o 


O 


o 


o 


41730 


A302 


o 


O 


o 


o 



Tag Name 



Field Name 



Exposure time 
F number 
Exposure program 
Spectral sensitivity 
ISO speed ratings 
Optoelectric coefficient 
Exif Version 

Date and time original image was 
generated 

Date and time image was made 
digital data 

Meaning of each component 

Image compression mode 

Shutter speed 

Aperture 

Brightness 

Exposure bias 

Maximum lens aperture 

Subject distance 

Metering mode 

Light source 

Flash 

Lens focal length 
Manufacturer notes 
User comments 
DateTime subseconds 
DateTimeOriginal subseconds 
DateTime Digitized subseconds 
Supported FlashPix version 
Color space information 
Valid image width 
Valid image height 
Related audio file 
Interoperability tag 
Flash energy 

Spatial frequency response 
Focal plane X resolution 
Focal plane Y resolution 
Focal plane resolution unit 
Subject location 
Exposure index 
Sensing method 
File source 
Scene type 
CFA pattern 



ExposureTime 

FNumber 

ExposureProgram 

SpectralSensitivity 

I SOS peed Ratings 

OECF 

ExifVersion 

DateTimeOriginal 

DateTimeDigitized 

ComponentsConfiguration 

CompressedBitsPerPixel 

ShutterSpeedValue 

ApertureValue 

BrightnessValue 

ExposureBias Value 

MaxApertureValue 

SubjectDistance 

MeteringMode 

LightSource 

Flash 

FocalLength 

MakerNote 

UserComment 

SubSecTime 

SubSecTimeOiiginal 

SubSecTimeDigitized 

FlashPixVersion 

ColorS pace 

PixelXDimension 

PixelYDimension 

RelatedSoundFile 

Interoperability IFD Pointer 

FlashEnergy 

SpatialFrequencyResponse 

FocalPlaneXResolution 

FocalPlaneYResolution 

FocalPlaneResolutionUnit 

SubjectLocation 

Exposurelndex 

SensingMethod 

FileSource 

SceneType 

CFAPattern 



Notation 

M ■ Mandatory (must be recorded) 

R :' Conditionally mandatory (must be recorded if hardware permits) 
O : Optional 

N : Not recorded . 
J : Included in JPEG marker and so not recorded 
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Table 1 6 Tag Support Levels (3) - Oth IFD GPS Info Tags - 



Tag Name 


Field Name 


Tag ID 


Uncompressed 


Comp- 
ressed 


Dec 


Hex 


Chunky 


Planar 


YCC 


GPS tag version 


GPSVersionID 


0 


0 


o 


O 


o 


o 


North or South Latitude 


GPSLatitudeRef 


1 


1 


o 


o 


o 


o 


Latitude 


GPSLatitude 


2 


2 


o 


o 


o 


o 


East or West Longitude 


GPSLongitudeRef 


3 


3 


o 


o 


o 


o 


Longitude 


GPSLongitude 


4 


4 


o 


o 


o 


o 


Altitude reference 


GPSAItitudeRef 


5 


5 


o 


o 


o 


o 


Altitude 


GPSAItitude 


g 


g 


o 


o 


o 


o 


GPS time (atomic clock) 


GPSTimeStamp 


7 


7 


o 


o 


o 


o 


GPS satellites used for measurement 


GPSSatellites 


A 
o 


3 


o 


o 


o 


o 


GPS receiver status 


GPSStatus 


Q 


Q 


0 


o 


o 


o 


GPS measurement mode 


GPSMeasureMode 


10 


A 


O 


O 


o 


o 


Measurement precision 


GPSDOP 


11 


B 


O 


O 


o 


o 


Speed unit 


GPSSpeedRef 


12 


C 


O 


O 


o 


w 


Speed of GPS receiver 


GPSSpeed 


13 


D 


o 


o 


o 


o 


Reference for direction of movement 


GPSTrackRef 


14 


E 


o 


o 


o 


o 


Direction of movement 


GPSTrack 


15 


F 


o 


o 


o 


o 


Reference for direction of image 


G PSImgDirectionRef 


16 


10 


o 


o 


o 


o 


Direction of image 


GPSImgDirection 


17 


11 


o 


o 


o 


o 


Geodetic survey data used 


GPSMapDatum 


18 


12 


o 


o 


o 


o 


Reference for latitude of destination 


GPSDestLatitudeRef 


19 


13 


o 


o 


o 


o 


Latitude of destination 


GPSDestLatitude 


20 


14 


o 


o 


o 


o 


Reference for longitude of destination 


GPSDestLongitudeRef 


21 


15 


o 


o 


o 


o 


Longitude of destination 


G PS DestLongitude 


22 


16 


o 


o 


o 


o 


Reference for bearing of destination 


GPSDestBearingRef 


23 


17 


o 


o 


o 


o 


Bearing of destination 


GPSDestBearing 


24 


18 


o 


o 


o 


o 


Reference for distance to destination 


GPSDestDistanceRef 


25 


19 


o 


o 


o 


o 




GPSDestDistance 


26 


1A 


o 


o 


o 


o 



Table 17 Tag Support Levels (4) - Oth IFD Interoperability Tag 



Tag Name Field Name 


Tag ID 


Uncompressed 


Comp- 
ressed 


Dec 


Hex 


Chunky 


Planar 


YCC 


Interoperability Identification Interoperability Index 


0 


0 


N 


N 


N 


O 



Notation 

M : Mandatory (must be recorded) 

R : Conditionally mandatory (must be recorded if hardware permits) 
O : Optional 
N : Not recorded 

J : Included in JPEG marker and so not recorded 
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B. Thumbnail (1st IFD) Support Levels 

The support levels of thumbnail (1st IFD) tags are shown in Table 1 8. 

Table 1 8 Tag Support Levels (5) - 1st IFD TIFF Tag 



Tag Name 



Image width 
Image height 

Number of bits per component 
Compression scheme 
Pixel composition 
Image title 

Manufacturer of image input 
equipment 

Model of image input equipment 

Image data location 

Orientation of image 

Number of components 

Number of rows per strip 

Bytes per compressed strip 

Image resolution in width direction 

Image resolution in height direction 

Image data arrangement 

Unit of X and Y resolution 

Transfer function 

Software used 

File change date and time 

Person who created the image 

White point chromaticity 

Chromaticities of primaries 

Offset to JPEG SOI 

Bytes of JPEG data 

Color space transformation matrix 

coefficients 

Subsampling ratio of Y to C 

Y and C positioning 

Pair of black and white reference 

values 

Copyright holder 
Exif tag 

GPS tag 



rlciu incline 


Tag ID 


Uncompressed 


Comprc 
d 


Dec 


Hex 


onunKy 


Planar 






256 


100 


M 


M 


M 


J 




257 


101 


M 


M 


M 


J 




258 


102 


M 


M 


M 


J 




259 


103 


M 


M 


M 


M 


P hntrtm ri (** 1 ntP m retati on 


262 


106 


M 


M 


M 


J 


ImageDescription 


270 


10E 


O 


O 


O 


O 


Make 


271 


10F 


O 


O 


O 


r\ 
\J 


Model 


272 


110 


O 


O 


O 


O 


StripOffsets 


273 


111 


M 


M 


M 


N 




274 


112 


r\ 
w 


r\ 
\J 


yj 


O 


SamplesPerPixel 


277 


115 


M 


M 


M 


J 


RowsPerStrip 


278 


116 


M 


M 


M 


N 


StripByteCounts 


279 


117 


M 


M 


M 


N 


X Resolution 


282 


11A 


M 


M 


M 


M 


Y Resolution 


283 


11B 


M 


M 


M 


M 


PlanarConfiguration 


284 


11C 


O 


M 


O 


i 


ResolutionUnit 


296 


128 


M 


M 


M 


M 


TransferFunction 


301 


12D 


O 


O 


O 


O 


Software 


305 


131 


O 


O 


O 


O 


DateTime 


306 


132 


o 


O 


o 


o 


Artist 


315 


13B 


o 


O 


o 


o 


WhitePoint 


318 


13E 


o 


O 


o 


o 


PrimaryChromaticities 


319 


13F 


o 


O 


o 


o 


J PEG InterchangeFormat 


513 


201 


N 


N 


N 


M 


JPEGInterchangeFormatLength 


514 


202 


N 


N 


N 


M 


YCbCrCoefficients 


529 


211 


N 


N 


O 


O 


YCbCrS ubSampIing 


530 


212 


N 


N 


M 


J 


YCbCrPositioning 


531 


213 


N 


N 


O 


O 


ReferenceBlackWhite 


532 


214 


O 


O 


O 


O 


Copyright 


33432 


8298 


O 


O 


O 


O 


Exif IFD Pointer 


34665 


8769 


O 


O 


O 


o 


GPSInfo IFD Pointer 


34853 


8825 


O 


O 


o 


o 



Notation 

M : Mandatory {must be recorded) 

R : Conditionally mandatory (must be recorded if hardware permits) 

O : Optional 

N : Not recorded 

J : Included in JPEG marker and so not recorded 
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1. Introduction 

1.1 Purpose 

The Hypertext Transfer Protocol (HTTP) is an appl i cat ion- level 
protocol with the lightness and speed necessary for distributed, 
collaborative, hypermedia information systems. HTTP has been in use 
by the World-Wide Web global information initiative since 1990. This 
specification reflects common usage of the protocol referred too as 
"HTTP/1.0". This specification describes the features that seem to be 
consistently implemented in most HTTP/1.0 clients and servers. The . 
specification is split into two sections. Those features of HTTP for 
which implementations are usually consistent are described in the 
main body of this document. Those features which have few or 
inconsistent implementations are listed in Appendix D. 

Practical information systems require more functionality than simple 
retrieval, including search, front-end update, and annotation. HTTP 
allows an open-ended set of methods to be used to indicate the 
purpose of a request. It builds on the discipline of reference 
provided by the Uniform Resource Identifier (URI) [2], as a location 
(URL) [4] or name (URN) [16], for indicating the resource on which a 
method is to be applied. Messages are passed in a format similar to 
that used by Internet Mail [7] and the Multipurpose Internet Mail 
Extensions (MIME) [5]. 

HTTP is also used as a generic protocol for communication between 
user agents and proxies/gateways to other Internet protocols, such as 
SfflP [12], NNTP 111], FTP [14], Gopher [1], and WAIS [8], allowing 
basic hypermedia access to resources available from diverse 
applications and simplifying the implementation of user agents. 

1.2 Terminology 

This specification uses a number of terms to refer to the roles 
played by participants in, and objects of, the HTTP communication. 

connection 

A transport layer virtual circuit established between two 
application programs for the purpose of communication. 

message 

The basic unit of HTTP communication, consisting of a structured 
sequence of octets matching the syntax defined in Section 4 and 
transmitted via the connection. 
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request 

An HTTP request message (as defined in Section 5). 
response 

An HTTP response message (as defined in Section 6). 
resource 

A network data object or service which can be identified by a 
URI (Section 3.2). 

entity 

A particular representation or rendition of a data resource, or 
reply from a service resource, that may be enclosed within a 
request or response message. An entity consists of 
met a information in the form of entity headers and content in the 
form of an entity body. 

cl ient 

An application program that establishes connections for the 
purpose of sending requests. 

user agent 

The client which initiates a request. These are often browsers, 
editors, spiders (web- traversing robots), or other end user 
tools. 

server 

An application program that accepts connections in order to 
service requests by sending back responses. 

origin server 

The server on which a given resource resides or is to be created, 
proxy 

An intermediary program which acts as both a server and a client 
for the purpose of making requests on behalf of other clients. 
Requests are serviced internally or by passing them, with 
possible translation, on to other servers. A proxy must 
interpret and, if necessary, rewrite a request message before 
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forwarding it. Proxies are often used as client-side portals 
through network firewalls and as helper applications for 
handling requests via protocols not implemented by the user 
agent. 

gateway 

A server which acts as an intermediary for some other server. 
Unlike a proxy, a gateway receives requests as if it were the 
origin server for the requested resource; the requesting client 
may not be aware that it is communicating with a gateway. 
Gateways are often used as server-side portals through network 
firewalls and as protocol translators for access to resources 
stored on non-HTTP systems. 

tunnel 

A tunnel is an intermediary program which is acting as a blind 
relay between two connections. Once active, a tunnel is not 
considered a party to the HTTP communication, though the tunnel 
may have been initiated by an HTTP request. The tunnel ceases to 
exist when both ends of the relayed connections are closed. 
Tunnels are used when a portal is necessary and the intermediary 
cannot, or should not, interpret the relayed communication. 

cache 

A program's local store of response messages and the subsystem 
that controls its message storage, retrieval, and deletion. A 
cache stores cachable responses in order to reduce the response 
time and network bandwidth consumption on future, equivalent 
requests. Any client or server may include a cache, though a 
cache cannot be used by a server while it is acting as a tunnel. 

Any given program may be capable of being both a client and a server; 
our use of these terms refers only to the role being performed by the 
program for a particular connection, rather than to the program's 
capabilities in general. Likewise, any server may act as an origin 
server, proxy, gateway, or tunnel, switching behavior based on the 
nature of each request. 

1.3 Overall Operation 

The HTTP protocol is based on a request/response paradigm. A client 
establishes a connection with a server and sends a request to the 
server in the form of a request method, URI, and protocol version, 
followed by a MIME-like message containing request modifiers, client 
information, and possible body content. The server responds with a 
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status line, including the message's protocol version and a success 
or error code, followed by a MIME-like message containing server 
information, entity metainformat ion, and possible body content. 

Most HTTP communication is initiated by a user agent and consists of 
a request to be applied to a resource on some origin server. In the 
simplest case, this may be accomplished via a single connection (v) 
between the user agent (UA) and the origin server (0). 

request chain > 

UA v — 0 

< ■ response chain 

A more complicated situation occurs when one or more intermediaries 
are present in the request/response chain. There are three common 
forms of intermediary: proxy, gateway, and tunnel. A proxy is a 
forwarding agent, receiving requests for a URI in its absolute form, 
rewriting all or parts of the message, and forwarding the reformatted 
request toward the server identified by the URI. A gateway is a 
receiving agent, acting as a layer above some other server(s) and, if 
necessary, translating the requests to the underlying server's 
protocol. A tunnel acts as a relay point between two connections 
without changing the messages; tunnels are used when the 
communication needs to pass through an intermediary (such as a 
firewall) even when the intermediary cannot understand the contents 
of the messages. 

request chain > 

UA v A v B v C v 0 

< response chain 

The figure above shows three intermediaries (A, B, and C) between the 
user agent and origin server. A request or response message that 
travels the whole chain must pass through four separate connections. 
This distinction is important because some HTTP communication options 
may apply only to the connection with the nearest, non-tunnel 
neighbor, only to the end-points of the chain, or to all connections 
along the chain. Although the diagram is linear, each participant may 
be engaged in multiple, simultaneous communications. For example, B 
may be receiving requests from many clients other than A, and/or 
forwarding requests to servers other than C, at the same time that it 
is handling A's request. 

Any party to the communication which is not acting as a tunnel may 
employ an internal cache for handling requests. The effect of a cache 
is that the request/response chain is shortened if one of the 
participants along the chain has a cached response applicable to that 
request. The following illustrates the resulting chain if B has a 
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cached copy of an earlier response from 0 (via C) for a request which 
has not been cached by UA or A. 

request chain > 

UA v A v B C 0 

< response chain 

Not all responses are cachable, and some requests may contain 
modifiers which place special requirements on cache behavior. Some 
HTTP/1. 0 applications use heuristics to describe what is or is not a 
"cachable" response, but these rules are not standardized. 

On the Internet, HTTP communication generally takes place over TCP/IP 
connections. The default port is TCP 80 [15], but other ports can be 
used. This does not preclude HTTP from being implemented on top of 
any other protocol on the Internet, or on other networks. HTTP only 
presumes a reliable transport; any protocol that provides such 
guarantees can be used, and the mapping of the HTTP/1.0 request and 
response structures onto the transport data units of the protocol in 
question is outside the scope of this specification. 

Except for experimental applications, current practice requires that 
the connection be established by the client prior to each request and 
closed by the server after sending the response. Both clients and 
servers should be aware that either party may close the connection 
prematurely, due to user action, automated time-out, or program 
failure, and should handle such closing in a predictable fashion. In 
any case, the closing of the connection by either or both parties 
always terminates the current request, regardless of its status. 

1.4 HTTP and MIME 

HTTP/1.0 uses many of the constructs defined for MIME, as defined in 
RFC 1521 [5]. Appendix C describes the ways in which the context of 
HTTP allows for different use of Internet Media Types than is 
typically found in Internet mail, and gives the rationale for those 
differences. 

2. Notational Conventions and Generic Grammar 

2. 1 Augmented BNF 

All of the mechanisms specified in this document are described in 
both prose and an augmented Backus-Naur Form (BNF) similar to that 
used by RFC 822 [7]. Implementors will need to be familiar with the 
notation in order to understand this specification. The augmented BNF 
includes the following constructs: 
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name - definition 

The name of a rule is simply the name itself (without any 
enclosing "<" and ">") and is separated from its definition by 
the equal character "=". Whitespace is only significant in that 
indentation of continuation lines is used to indicate a rule 
definition that spans more than one line. Certain basic rules 
are in uppercase, such as SP, LWS, HT, CRLF, DIGIT, ALPHA, etc. 
Angle brackets are used within definitions whenever their 
presence will facilitate discerning the use of rule names. 

"literal" 

Quotation marks surround literal text. Unless stated otherwise, 
the text is case- insensitive. 

rulel I rule2 

Elements separated by a bar ("I") are alternatives, 
e.g., "yes I no" will accept yes or no. 

(rulel rule2) 

Elements enclosed in parentheses are treated as a single 
element. Thus, "(elem (foo I bar) elem)" allows the token 
sequences "elem foo elem" and "elem bar elem". 

*rule . 

The character "*" preceding an element indicates repetition. The 
full form is "<n>*<m>element" indicating at least <n> and at 
most <m> occurrences of element. Default values are 0 and 
infinity so that "* (element)" allows any number, including zero; 
"l*element" requires at least one; and "l*2element" allows one 
or two. 

[rule] 

Square brackets enclose optional elements; "[foo bar]" is 
equivalent to "*l(foo bar)". 

N rule 

Specific repetition: "<n> (element) " is equivalent to 
"<n>*<n> (element) " ; that is, exactly <n> occurrences of 
(element). Thus 2DIGIT is a 2-digit number, and 3ALPHA is a 
string of three alphabetic characters. 
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#rule 

A construct "#" is defined, similar to "*", for defining lists 
of elements. The full form is "<n>#<m>element" indicating at 
least <n> and at most <m> elements, each separated by one or 
more commas (",") and optional linear whitespace (LWS). This 
makes the usual form of lists very easy; a rule such as 
"( *LWS element *( *LWS "," *LWS element ))" can be shown as 
"l#element". Wherever this construct is used, null elements are 
allowed, but do not contribute to the count of elements present. 
That is, "(element), , (element)" is permitted, but counts as 
only two elements. Therefore, where at least one element is 
required, at least one non-null element must be present. Default 
values are 0 and infinity so that "# (element)" allows any 
number, including zero; "ltfelement" requires at least one; and 
"l#2element" allows one or two. 

; comment 

A semi-colon, set off some distance to the right of rule text, 
starts a comment that continues to the end of line. This is a 
simple way of including useful notes in parallel with the 
specifications. 

implied *LWS 

The grammar described by this specification is word-based. 
Except where noted otherwise, linear whitespace (LWS) can be 
included between any two adjacent words (token or 
quoted-string) , and between adjacent tokens and delimiters 
(tspecials), without changing the interpretation of a field. At 
least one delimiter (tspecials) must exist between any two 
tokens, since they would otherwise be interpreted as a single 
token. However, applications should attempt to follow "common 
form" when generating HTTP constructs, since there exist some 
implementations that fail to accept anything beyond the common 
forms. 

2.2 Basic Rules 

The following rules are used throughout this specification to 
describe basic parsing constructs. The US-ASCII coded character set 
is defined by [17]. 



OCTET = <any 8-bit sequence of data> 

CHAR = <any US-ASCII character (octets 0 - 127) > 

UPALPHA = <any US-ASCII uppercase letter "A".."Z"> 

LOALPHA = <any US-ASCII lowercase letter "a".."z"> 
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ALPHA = UPALPHA I LOALPHA 

DIGIT = <any US-ASCII digit "0".."9"> 

CTL = <any US-ASCII control character 
(octets 0 - 31) and DEL (127) > 

CR = <US-ASCII CR, carriage return (13) > 

LF = <US-ASCII LF, linefeed (10)> 

SP = <US-ASCII SP, space (32) > 

HT = <US-ASCII HT, horizontal-tab (9)> 

<"> = <US-ASCII double-quote mark (34)> 



HTTP/1.0 defines the octet sequence CR LF as the end-of-line marker 
for all protocol elements except the Entity-Body (see Appendix B for 
tolerant applications). The end-of-line marker within an Entity-Body 
is defined by its associated media type, as described in Section 3.6. 

CRLF = CR LF 

HTTP/1.0 headers may be folded onto multiple lines if each 
continuation line begins with a space or horizontal tab. All linear 
whitespace, including folding, has the same semantics as SP. 

LWS = [CRLF] 1*( SP I HT ) 

However, folding of header lines is not expected by some 
applications, and should not be generated by HTTP/1.0 applications. 

The TEXT rule is only used for descriptive field contents and values 
that are not intended to be interpreted by the message parser. Words 
of TEXT may contain octets from character sets other than US-ASCII. 

TEXT = <any OCTET except CTLs, 

but including LWS> 

Recipients of header field TEXT containing octets outside the US- 
ASCII character set may assume that they represent ISO-8859-1 
characters. 

Hexadecimal numeric characters are used in several protocol elements. 

HEX = "A" I "B" I "C" I "D" J "E" I "F" 

I "a" I "b" I "c" I "d" I "e" I "f" I DIGIT 

Many HTTP/1. 0 header field values consist of words separated by LWS 
or special characters. These special characters must be in a quoted 
string to be used within a parameter value. 

word = token I quoted-string 
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token = l*<any CHAR except CTLs or tspecials> 

tspecials = I ")" I "<" I ">" I "®" 

I V I ";" I * I "¥" I <"> 

I V" I "[" I "]" I "?" I "=" 

I "|" I "f" I SP I HT 

Comments may be included in some HTTP header fields by surrounding 
the comment text with parentheses. Comments are only allowed in 
fields containing "comment'* as part of their field value definition. 
In all other fields, parentheses are considered part of the field 
value. 

comment = " (" *( ctext I comment ) ")". 

ctext = <any TEXT excluding "(" and ")"> 

A string of text is parsed as a single word if it is quoted using 
double-quote marks. 

quoted-string = ( <"> *(qdtext) <"> ) 

qdtext = <any CHAR except <"> and CTLs, 

but including LWS> 

Single-character quoting using the backslash ("¥")- character is not 
permitted in HTTP/1.0. 

3. Protocol Parameters 

3. 1 HTTP Version 

HTTP uses a "<major>. <minor>" numbering scheme to indicate versions 
of the protocol. The protocol versioning policy is intended to allow 
the sender to indicate the format of a message and its capacity for 
understanding further HTTP communication, rather than the features 
obtained via that communication. No change is made to the version 
number for the addition of message components which do not affect 
communication behavior or which only add to extensible field values. 
The <minor> number is incremented when the changes made to the 
protocol add features which do not change the general message parsing 
algorithm, but which may add to the message semantics and imply 
additional capabilities of the sender. The <major> number is 
incremented when the format of a message within the protocol is 
changed. 

The version of an HTTP message is indicated by an HTTP-Version field 
in the first line of the message. If the protocol version is not 
specified, the recipient must assume that the message is in the 
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simple HTTP/0.9 format. 

HTTP-Version = "HTTP" "/" 1*DIGIT "." 1*DIGIT 

Note that the major and minor numbers should be treated as separate 
integers and that each may be incremented higher than a single digit. 
Thus, HTTP/2.4 is a lower version than HTTP/2.13, which in turn is 
lower than HTTP/12.3. Leading zeros should be ignored by recipients 
and never generated by senders. 

This document defines both the 0.9 and 1.0 versions of the HTTP 
protocol. Applications sending Full-Request or Full-Response 
messages, as defined by this specification, must include an HTTP- 
Version of "HTTP/1. 0". 

HTTP/1.0 servers must: 

o recognize the format of the Request-Line for HTTP/0.9 and 
HTTP/1.0 requests; 

o understand any valid request in the format of HTTP/0.9 or 
HTTP/1.0; 

o respond appropriately with a message in the same protocol 
version used by the client. 

tfITP/1.0 clients must: 

o recognize the format of the Status-Line for HTTP/1.0 responses; 

o understand any valid response in the format of HTTP/0.9 or 
HTTP/1. 0. 

Proxy and gateway applications must be careful in forwarding requests 
that are received in a format different than that of the 
application's native HTTP version. Since the protocol version 
indicates the protocol capability of the sender, a proxy/gateway must 
never send a message with a version indicator which is greater than 
its native version; if a higher version request is received, the 
proxy/gateway must either downgrade the request version or respond 
with an error. Requests with a version lower than that of the 
application's native format may be upgraded before being forwarded; 
the proxy/gateway's response to that request must follow the server 
requirements listed above. 
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3.2 Uniform Resource Identifiers 



URIs have been known by many names: WWW addresses, Universal Document 
Identifiers, Universal Resource Identifiers [2], and finally the 
combination of Uniform Resource Locators (URL) [4] and Names (URN) 
[16]. As far as HTTP is concerned, Uniform Resource Identifiers are 
simply formatted strings which identify — via name, location, or any 
other characteristic — a network resource. 

3.2.1 General Syntax 

URIs in HTTP can be represented in absolute form or relative to some 
known base URI [9], depending upon the context of their use. The two 
forms are differentiated by the fact that absolute URIs always begin 
with a scheme name followed by a colon. 



URI 

absoluteURI 

relativeURI 

net_path 
abs_path 
rel_path 

path 

f segment 

segment 

params 
param 

scheme 
net_loc 
query 
fragment 

pchar 
uchar 

unreserved 

escape 

reserved 

extra 

safe 

unsafe 

national 



= ( absoluteURI I relativeURI ) [ fragment ] 

= scheme ":" *( uchar I reserved ) 

= net_path I abs_path I rel_path 

= "//" net_loc [ abs_path ] 
= "/" rel_path 

= [ path J [ params ] [ query ] 



f segment 
1 *pchar 
*pchar 



'( "/" segment ) 



param ) 



= param 

= *( pchar I "/'" ) 

= 1*( ALPHA I DIGIT I "+" 
= *( pchar I ";" I "?" ) 
= *( uchar I reserved ) 
= *( uchar I reserved ) 



I 



I 



I "-" I 



I 



uchar I " : " 
unreserved I escape 

ALPHA I DIGIT I safe I extra I national 



"%" HEX HEX 

";" I V" I "?" I ":" I "@" I "&" 
"!" I "*" I "•" I "(" I •')" I "," 
"$" I "-" I "_" I "•" 
CTL I SP I <"> I I "%" I "<" I 
<any OCTET excluding ALPHA, DIGIT, 



| 
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reserved, extra, safe, and unsafe> 

For definitive information on URL syntax and semantics, see RFC 1738 
[4] and RFC 1808 [9]. The BNF above includes national characters not 
allowed in valid URLs as specified by RFC 1738, since HTTP servers 
are not restricted in the set of unreserved characters allowed to 
represent the rel_path part of addresses, and HTTP proxies may 
receive requests for URIs not defined by RFC 1738. 

3.2.2 http URL 

The "http" scheme is used to locate network resources via the HTTP 
protocol. This section defines the scheme-specific syntax and 
semantics for http URLs. 

httpJJRL = "http:" "//" host [ ":" port ] [ abs^path ] 

host = <A legal Internet host domain name 

or IP address (in dotted-decimal form), 
as defined by Section 2.1 of RFC 1123> 

port = *DIGIT 

If the port is empty or not given, port 80 is assumed. The semantics 
are that the identified resource is located at the server listening 
for TCP connections on that port of that host, and the Request-URI 
for the resource is abs_path. If the abs_path is not present in the 
URL, it must be given as V" when used as a Request-URI (Section 
5.1.2). 

Note: Although the HTTP protocol is independent of the transport 
layer protocol, the http URL only identifies resources by their 
TCP location, and thus non-TCP resources must be identified by 
some other URI scheme. 

The canonical form for "http" URLs is obtained by converting any 
UPALPHA characters in host to their LOALPHA equivalent (hostnames are 
case-insensitive), eliding the [ ":" port ] if the port is 80, and 
replacing an empty abs_path with "/". 

3.3 Date/Time Formats 

HTTP/1.0 applications have historically allowed three different 
formats for the representation of date/time stamps: 

Sun, 06 Nov 1994 08:49:37 GMT ; RFC 822, updated by RFC 1123 
Sunday, 06-Nov-94 08:49:37 GMT ; RFC 850, obsoleted by RFC 1036 
Sun Nov 6 08:49:37 1994 ; ANSI C's asctimeO format 
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The first format is preferred as an Internet standard and represents 
a fixed-length subset of that defined by RFC 1123 [6] (an update to 
RFC 822 [7]J. The second format is in common use, but is based on the 
obsolete RFC 850 [10] date format and lacks a four-digit year. 
HTTP/1.0 clients and servers that parse the date value should accept 
all three formats, though they must never generate the third 
(asctime) format. 

Note: Recipients of date values are encouraged to be robust in 
accepting date values that may have been generated by non-HTTP 
applications, as is sometimes the case when retrieving or posting 
messages via proxies/gateways to SMTP or NNTP. 

All HTTP/LO date/time stamps must be represented in Universal Time 
(UT), also known as Greenwich Mean Time (GMT), without exception. 
This is indicated in the first two formats by the inclusion of "GMT" 
as the three-letter abbreviation for time zone, and should be assumed 
when reading the asctime format. 

HTTP-date = rfcll23-date I rfc850-date I asctime-date 

rfcll23-date = wkday , SP datel SP time SP "GMT" 
rfc850-date = weekday , SP date2 SP time SP "GMT" 
asctime-date = wkday SP date3 SP time SP 4DIGIT 

datel = 2DIGIT SP month SP 4DIGIT 

; day month year (e.g., 02 Jun 1982) 

date2 = 2DIGIT "-" month "-" 2DIGIT 

; day-month-year (e.g., 02-Jun-82) 

date3 = month SP ( 2DIGIT I X SP 1DIGIT )) 

; month day (e. g. , Jun 2) 

time = 2DIGIT ":" 2DIGIT ":" 2DIGIT 

; 00:00:00 - 23:59:59 



wkday 



= "Mon" 
I "Thu" 



"Tue" 
"Fri" 



I "Wed" 
I "Sat" I 



"Sun" 



weekday = "Monday" I "Tuesday" I "Wednesday" 

I "Thursday" I "Friday" I "Saturday" 1 "Sunday" 

month = "Jan" I "Feb" I "Mar" I "Apr" 

I "May" I "Jun" I "Jul" I "Aug" 
I "Sep" I "Oct" I "Nov" I "Dec" 

Note: HTTP requirements for the date/time stamp format apply 
only to their usage within the protocol stream. Clients and 
servers are not required to use these formats for user 
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presentation, request logging, etc. 

3.4 Character Sets 

HTTP uses the same definition of the term "character set" as that 
described for MIME: 

The term "character set" is used in this document to refer to a 
method used with one or more tables to convert a sequence of 
octets into a sequence of characters. Note that unconditional 
conversion in the other direction is not required, in that not all 
characters may be available in a given character set and a 
character set may provide more than one sequence of octets to 
represent a particular character. This definition is intended to 
allow various kinds of character encodings, from simple single- 
table mappings such as US-ASCII to complex table switching methods 
such as those that use ISO 2022' s techniques. However, the 
definition associated with a MIME character set name must fully 
specify the mapping to be performed from octets to characters. In 
particular, use of external profiling information to determine the 
exact mapping is not permitted. 

Note: This use of the term "character set" is more commonly 
referred to as a "character encoding." However, since HTTP and 
MIME share the same registry, it is important that the terminology 
also be shared. 

HTTP character sets are identified by case-insensi tive tokens. The 
complete set of tokens are defined by the IANA Character Set registry 
[15]. However, because that registry does not define a single, 
consistent token for each character set, we define here the preferred 
names for those character sets most likely to be used with HTTP 
entities. These character sets include those registered by RFC 1521 
[5] __ the US-ASCII [17] and ISO-8859 [18] character sets — and 
other names specifically recommended for use within MIME charset 
parameters. 

charset = "US-ASCII" 

I "ISO-8859-1" I "ISO-8859-2" I "ISO-8859-3" 

I "ISO-8859-4" I "ISO-8859-5" I "ISO-8859-6" 

I "ISO-8859-7" I "ISO-8859-8" I "ISO-8859-9" 

I "ISO-2022-JP" I "IS0-2022-JP-2" I "ISO-2022-KR" 

I "UNICODE-1-1" I "UNIC0DE-1-1-UTF-7" I "UNIC0DE-1-1-UTF-8" 

I token 

Although HTTP allows an arbitrary token to be used as a charset 
value, any token that has a predefined value within the IANA 
Character Set registry [15] must represent the character set defined 
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by that registry. Applications should limit their use of character 
sets to those defined by the IANA registry. 

The character set of an entity body should be labelled as the lowest 
common denominator of the character codes used within that body, with 
the exception that no label is preferred over the labels US-ASCII or 
ISO-8859-1. 

3.5 Content Codings 

Content coding values are used to indicate an encoding transformation 
that has been applied to a resource. Content codings are primarily 
used to allow a document to be compressed or encrypted without losing 
the identity of its underlying media type. Typically, the resource is 
stored in this encoding and only decoded before rendering or 
analogous usage. 

content-coding = "x-gzip" I "x-compress" I token 

Note: For future compatibility, HTTP/1.0 applications should 
consider "gzip" and "compress" to be equivalent to "x-gzip" 
and "x-compress", respectively. 

All con tent -coding values are case-insensi t ive. HTTP/1.0 uses 
content-coding values in the Content-Encoding (Section 10.3) header 
field. Although the value describes the con tent -coding, what is more 
important is that it indicates what decoding mechanism will be 
required to remove the encoding. Note that a single program may be 
capable of decoding multiple content-coding formats. Two values are 
defined by this specification: 

x-gzip 

An encoding format produced by the file compression program 
"gzip" (GNU zip) developed by Jean-loup Gailly. This format is 
typically a Lempel-Ziv coding (LZ77) with a 32 bit CRC. 

x-compress 

The encoding format produced by the file compression program 
"compress". This format is an adaptive Lempel-Ziv-Welch coding 
(LZW). 

Note: Use of program names for the identification of 
encoding formats is not desirable and should be discouraged 
for future encodings. Their use here is representative of 
historical practice, not good design. 
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3.6 Media Types 

HTTP uses Internet Media Types [13] in the Content-Type header field 
(Section 10.5) in order to provide open and extensible data typing. 



Parameters may follow the type/subtype in the form of attribute/value 
pairs. 



The type, subtype, and parameter attribute names are case- 
insensitive. Parameter values may or may not be case-sensitive, 
depending on the semantics of the parameter name. LWS must not be 
generated between the type and subtype, nor between an attribute and 
its value. Upon receipt of a media type with an unrecognized 
parameter, a user agent should treat the media type as if the 
unrecognized parameter and its value were not present. 

Some older HTTP applications do not recognize media type parameters. 
HTTP/1. 0 applications should only use media type parameters when they 
are necessary to define the content of a message. 

Media-type values are registered with the Internet Assigned Number 
Authority (IANA [15]). The media type registration process is 
outlined in RFC 1590 [13]. Use of non-registered media types is 
discouraged. 

3.6.1 Canonical izat ion and Text Defaults 

Internet media types are registered with a canonical form. In 
general, an Entity-Body transferred via HTTP must be represented in 
the appropriate canonical form prior to its transmission. If the body 
has been encoded with a Con tent -Encoding, the underlying data should 
be in canonical form prior to being encoded. 

Media subtypes of the "text" type use CRLF as the text line break 
when in canonical form. However, HTTP allows the transport of text 
media with plain CR or LF alone representing a line break when used 
consistently within the Entity-Body. HTTP applications must accept 
CRLF, bare CR, and bare LF as being representative of a line break in 
text media received via HTTP. 
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In addition, if the text media is represented in a character set that 
does not use octets 13 and 10 for CR and LF respectively, as is the 
case for some multi-byte character sets, HTTP allows the use of 
whatever octet sequences are defined by that character set to 
represent the equivalent of CR and LF for line breaks. This 
flexibility regarding line breaks applies only to text media in the 
Entity-Body; a bare CR or LF should not be substituted for CRLF 
within any of the HTTP control structures (such as header fields and 
multipart boundaries). 

The "charset" parameter is used with some media types to define the 
character set (Section 3.4) of the data. When no explicit charset 
parameter is provided by the sender, media subtypes of the "text" 
type are defined to have a default charset value of "IS0-8859-1" when 
received via HTTP. Data in character sets other than "IS0-8859-1" or 
its subsets must be labelled with an appropriate charset value in 
order to be consistently interpreted by the recipient. 

Note: Many current HTTP servers provide data using charsets other 
than "IS0-8859-1" without proper labelling. This situation reduces 
interoperability and is not recommended. To compensate for this, 
some HTTP user agents provide a configuration option to allow the 
user to change the default interpretation of the media type 
character set when no charset parameter is given. 

3.6.2 Multipart Types 

MIME provides for a number of "multipart" types — encapsulations of 
several entities within a single message's Entity-Body. The multipart 
types registered by IANA [15] do not have any special meaning for 
HTTP/1. 0, though user agents may need to understand each type in 
order to correctly interpret the purpose of each body-part. An HTTP 
user agent should follow the same or similar behavior as a MIME user 
agent does upon receipt of a multipart type. HTTP servers should not 
assume that all HTTP clients are prepared to handle multipart types. 

All multipart types share a common syntax and must include a boundary 
parameter as part of the media type value. The message body is itself 
a protocol element and must therefore use only CRLF to represent line 
breaks between body-parts. Multipart body-parts may contain HTTP 
header fields which are significant to the meaning of that part. 

3.7 Product Tokens 

Product tokens are used to allow communicating applications to 
identify themselves via a simple product token, with an optional 
slash and version designator. Most fields using product tokens also 
allow subproducts which form a significant part of the application to 
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be listed, separated by whitespace. By convention, the products are 
listed in order of their significance for identifying the 
appl i cat ion. 



product 

product-version 



token ["/" product-version] 
token 



Examp 1 es : 

User-Agent: CERN-LineMode/2. 15 1 ibwww/2. 17b3 

Server: Apache/0.8.4 

Product tokens should be short and to the point -- use of them for 
advertizing or other non-essential information is explicitly 
forbidden. Although any token character may appear in a product- 
version, this token should only be used for a version identifier 
(i.e., successive versions of the same product should only differ in 
the product-version portion of the product value). 

4. HTTP Message 

4. 1 Message Types 

HTTP messages consist of requests from client to server and responses 
from server to client. 



HTTP-message 



= Simple-Request 
I Simple-Response 
I Full -Request 
I Ful 1 -Response 



; HTTP/0. 9 messages 
; HTTP/1. 0 messages 



Full-Request and Full-Response use the generic message format of RFC 
822 [7] for transferring entities. Both messages may include optional 
header fields (also known as "headers") and an entity body. The 
entity body is separated from the headers by a null line (i.e., a 
line with nothing preceding the CRLF) . 



Ful 1 -Request 



Ful 1 -Response 



= Request -Line 
*( General -Header 
I Request -Header 
I Entity-Header ) 
CRLF 

[ Entity-Body ] 

= Status-Line 

*( General -Header 
I Response-Header 



Section 5. 1 
Section 4.3 
Section 5.2 
Section 7. 1 



Section 7.2 

Section 6. 1 
Section 4. 3 
Section 6.2 
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I Entity-Header ) 



Section 7. 1 



CRLF 
[ Entity-Body ] 



; Section 7.2 



Simple-Request and Simple-Response do not allow the use of any header 
information and are limited to a single request method (GET). 

Simple-Request = "GET" SP Request-URI CRLF 

Simple-Response = [ Entity-Body ] 

Use of the Simple-Request format is discouraged because it prevents 
the server from identifying the media type of the returned entity. 

4.2 Message Headers 

HTTP header fields, which include General -Header (Section 4.3), 
Request -Header (Section 5.2), Response-Header (Section 6.2), and 
Entity-Header (Section 7.1) fields, follow the same generic format as 
that given in Section 3.1 of RFC 822 [7]. Each header field consists 
of a name followed immediately by a colon (":"), a single space (SP) 
character, and the field value. Field names are case-insensi tive. 
Header fields can be extended over multiple lines by preceding each 
extra line with at least one SP or HT, though this is not 
recommended. 

HTTP-header = field-name ":" [ field- value ] CRLF 
field-name = token 

field-value = *( field-content I LWS ) 

field-content = <the OCTETs making up the field-value 

and consisting of either TEXT or combinations 
of token, tspecials, and quoted-string> 

The order in which header fields are received is not significant. 
However, it is "good practice" to send General -Header fields first, 
followed by Request -Header or Response-Header fields prior to the 
Entity-Header fields. 

Multiple HTTP-header fields with the same field-name may be present 
in a message if and only if the entire field-value for that header 
field is defined as a comma-separated list [i.e., #(values)]. It must 
be possible to combine the multiple header fields into one "field- 
name: field-value" pair, without changing the semantics of the 
message, by appending each subsequent field-value to the first, each 
separated by a comma. 
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4.3 General Header Fields 

There are a few header fields which have general applicability for 
both request and response messages, but which do not apply to the 
entity being transferred. These headers apply only to the message 
being transmitted. 

General -Header = Date ; Section 10.6 

I Pragma ; Section 10.12 

General header field names can be extended reliably only in 
combination with a change in the protocol version. However, new or 
experimental header fields may be given the semantics of general 
header fields if all parties in the communication recognize them to 
be general header fields. Unrecognized header fields are treated as 
Entity-Header fields. 

5. Request 

A request message from a client to a server includes, within the 
first line of that message, the method to be applied to the resource, 
the identifier of the resource, and the protocol version in use. For 
backwards compatibility with the more limited HTTP/0. 9 protocol, 
there are two valid formats for an HTTP request: 

Request = Simple-Request I Full-Request 

Simple-Request = "GET" SP Request-URI CRLF 

Full-Request = Request-Line ; Section 5.1 

*( General -Header ; Section 4.3 

I Request -Header ; Section 5.2 

I Entity-Header ) ; Section 7.1 
CRLF 

[ Entity-Body ] ; Section 7.2 

If an HTTP/1.0 server receives a Simple-Request, it must respond with 
an HTTP/0.9 Simple-Response. An HTTP/LO client capable of receiving 
a Full-Response should never generate a Simple-Request. 

5. 1 Request-Line 

The Request-Line begins with a method token, followed by the 
Request-URI and the protocol version, and ending with CRLF. The 
elements are separated by SP characters. No CR or LF are allowed 
except in the final CRLF sequence. 

Request-Line = Method SP Request-URI SP HTTP-Version CRLF 
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Note that the difference between a Simple-Request and the Request- 
Line of a Full-Request is the presence of the HTTP-Version field and 
the availability of methods other than GET. 

5.1.1 Method 

The Method token indicates the method to be performed on the resource 
identified by the Request-URL The method is case-sensitive. 

Method - "GET" ; Section 8.1 

I "HEAD" ; Section 8.2 

I "POST" ; Section 8.3 
I extension-method 

extension-method = token 

The list of methods acceptable by a specific resource can change 
dynamically; the client is notified through the return code of the 
response if a method is not allowed on a resource. Servers should 
return the status code 501 (not implemented) if the method is 
unrecognized or not implemented. 

The methods commonly used by HTTP/1.0 applications are fully defined 
in Section 8. 

5. 1.2 Request-URI 

The Request-URI is a Uniform Resource Identifier (Section 3.2) and 
identifies the resource upon which to apply the request. 

Request-URI = absoluteURI I abs_path 

The two options for Request-URI are dependent on the nature of the 
request. 

The absoluteURI form is only allowed when the request is being made 
to a proxy. The proxy is requested to forward the request and return 
the response. If the request is GET or HEAD and a prior response is 
cached, the proxy may use the cached message if it passes any 
restrictions in the Expires header field. Note that the proxy may 
forward the request on to another proxy or directly to the server 
specified by the absoluteURI. In order to avoid request loops, a 
proxy must be able to recognize all of its server names, including 
any aliases, local variations, and the numeric IP address. An example 
Request -Line would be: 

GET http://www.w3.org/pub/WWW/TheProject.html HTTP/1.0 
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The most common form of Request-URI is that used to identify a 
resource on an origin server or gateway. In this case, only the 
absolute path of the URI is transmitted (see Section 3.2.1, 
abs_path). For example, a client wishing to retrieve the resource 
above directly from the origin server would create a TCP connection 
to port 80 of the host "www.w3.org" and send the line: 

GET /pub/WWW/TheProject.html HTTP/1.0 

followed by the remainder of the Full-Request. Note that the absolute 
path cannot be empty; if none is present in the original URI, it must 
be given as V" (the server root). 

The Request-URI is transmitted as an encoded string, where some 
characters may be escaped using the "% HEX HEX" encoding defined by 
RFC 1738 [4]. The origin server must decode the Request-URI in order 
to properly interpret the request. 

5.2 Request Header Fields 

The request header fields allow the client to pass additional 
information about the request, and about the client itself, to the 
server. These fields act as request modifiers, with semantics 
equivalent to the parameters on a programming language method 
(procedure) invocation. 

Request -Header = Authorization ; Section 10.2 

I From ; Section 10. 8 

I If-Modif ied-Since ; Section 10.9 

I Referer ; Section 10.13 

I User-Agent ; Section 10.15 

Request -Header field names can be extended reliably only in 
combination with a change in the protocol version. However, new or 
experimental header fields may be given the semantics of request 
header fields if all parties in the communication recognize them to 
be request header fields. Unrecognized header fields are treated as 
Entity-Header fields. 

6. Response 

After receiving and interpreting a request message, a server responds 
in the form of an HTTP response message. 

Response = Simple-Response I Ful 1 -Response 

Simple-Response = [ Entity-Body ] 
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Full-Response = Status-Line 



*( General -Header 
I Response-Header 
I Entity-Header ) 
CRLF 



; Section 6. 1 
; Section 4.3 
; Section 6.2 
; Section 7. 1 



[ Entity-Body ] 



; Section 7.2 



A Simple-Response should only be sent in response to an HTTP/0.9 
Simple-Request or if the server only supports the more limited 
HTTP/0. 9 protocol. If a client sends an HTTP/1. 0 Full-Request and 
receives a response that does not begin with a Status-Line, it should 
assume that the response is a Simple-Response and parse it 
accordingly. Note that the Simple-Response consists only of the 
entity body and is terminated by the server closing the connection. 

6. 1 Status-Line 

The first line of a Full-Response message is the Status-Line, 
consisting of the protocol version followed by a numeric status code 
and its associated textual phrase, with each element separated by SP 
characters. No CR or LF is allowed except in the final CRLF sequence. 

Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF 

Since a status line always begins with the protocol version and 
status code 

"HTTP/" 1*DIGIT 1*DIGIT SP 3DIGIT SP 

(e.g., "HTTP/1. 0 200 "), the presence of that expression is 
sufficient to differentiate a Ful 1 -Response from a Simple-Response. 
Although the Simple-Response format may allow such an expression to 
occur at the beginning of an entity body, and thus cause a 
misinterpretation of the message if it was given in response to a 
Full-Request, most HTTP/0.9 servers are limited to responses of type 
"text/html" and therefore would never generate such a response. 

6.1.1 Status Code and Reason Phrase 

The Status-Code element is a 3-digit integer result code of the 
attempt to understand and satisfy the request. The Reason-Phrase is 
intended to give a short textual description of the Status-Code. The 
Status-Code is intended for use by automata and the Reason-Phrase is 
intended for the human user. The client is not required to examine or 
display the Reason-Phrase. 
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The first digit of the Status-Code defines the class of response. The 
last two digits do not have any categorization role. There are 5 
values for the first digit: 

o lxx: Informational - Not used, but reserved for future use 

o 2xx: Success - The action was successfully received, 
understood, and accepted. 

o 3xx: Redirection - Further action must be taken in order to 
complete the request 

o 4xx: Client Error - The request contains bad syntax or cannot 
be fulfilled 

o 5xx: Server Error - The server failed to fulfill an apparently 
valid request 

The individual values of the numeric status codes defined for 
HTTP/1.0, and an example set of corresponding Reason-Phrase's, are 
presented below. The reason phrases listed here are only recommended 
— they may be replaced by local equivalents without affecting the 
protocol. These codes are fully defined in Section 9. 



"200" 


; OK 


"201" 


; Created 


"202" 


; Accepted 


"204" 


; No Content 


"301" 


; Moved Permanently 


"302" 


' Moved Temporarily 


"304" 


Not Modified 


"400" 


Bad Request 


"401" 


Unauthorized 


"403" 


Forbidden 


"404" 


Not Found 


"500" 


Internal Server Error 


"501" ; 


Not Implemented 


"502" ; 


Bad Gateway 


"503" ; 


Serv i ce Unava i 1 ab 1 e 


extension-code 



extension-code = 3DIGIT 

Reason-Phrase = *<TEXT, excluding CR, LF> 

HTTP status codes are extensible, but the above codes are the only 
ones generally recognized in current practice. HTTP applications are 
not required to understand the meaning of all registered status 
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codes, though such understanding is obviously desirable. However, 
applications must understand the class of any status code, as 
indicated by the first digit, and treat any unrecognized response as 
being equivalent to the xOO status code of that class, with the 
exception that an unrecognized response must not be cached. For 
example, if an unrecognized status code of 431 is received by the 
client, it can safely assume that there was something wrong with its 
request and treat the response as if it had received a 400 status 
code. In such cases, user agents should present to the user the 
entity returned with the response, since that entity is likely to 
include human-readable information which will explain the unusual 
status. 

6.2 Response Header Fields 

The response header fields allow the server to pass additional 
information about the response which cannot be placed in the Status- 
Line. These header fields give information about the server and about 
further access to the resource identified by the Request-URI. 

Response-Header = Location ; Section 10.11 

I Server ; Section 10.14 

I WWW-Authenticate ; Section 10.16 

Response-Header field names can be extended reliably only in 
combination with a change in the protocol version. However, new or 
experimental header fields may be given the semantics of response 
header fields if all parties in the communication recognize them to 
be response header fields. Unrecognized header fields are treated as 
Entity-Header fields. 

7. Entity 

Full-Request and Full-Response messages may transfer an entity within 
some requests and responses. An entity consists of Entity-Header 
fields and (usually) an Entity-Body. In this section, both sender and 
recipient refer to either the client or the server, depending on who 
sends and who receives the entity. 
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7.1 Entity Header Fields 

Entity-Header fields define optional metainformation about the 
Entity-Body or, if no body is present, about the resource identified 
by the request. 



extension-header = HTTP-header 

The extension-header mechanism allows additional Entity-Header fields 
to be defined without changing the protocol, but these fields cannot 
be assumed to be recognizable by the recipient. Unrecognized header 
fields should be ignored by the recipient and forwarded by proxies. 

7.2 Entity Body 

The entity body (if any) sent with an HTTP request or response is in 
a format and encoding defined by the Entity-Header fields. 

Entity-Body = "OCTET 

An entity body is included with a request message only when the 
request method calls for one. The presence of an entity body in a 
request is signaled by the inclusion of a Content-Length header field 
in the request message headers. HTTP/1.0 requests containing an 
entity body must include a valid Content-Length header field. 

For response messages, whether or not an entity body is included with 
a message is dependent on both the request method and the response 
code. Al 1 responses to the HEAD request method must not include a 
body, even though the presence of entity header fields may lead one 
to believe they do. All lxx (informational), 204 (no content), and 
304 (not modified) responses must not include a body. All other 
responses must include an entity body or a Content -Length header 
field defined with a value of zero (0). 



When an Entity-Body is included with a message, the data type of that 
body is determined via the header fields Content-Type and Content- 
Encoding. These define a two-layer, ordered encoding model: 



Entity-Header = Allow 



Con t en t -Encod i ng 
Content-Length 
Content-Type 
Expires 
Last -Mod if ied 
ex t ens i on-header 



; Section 10. 1 
; Section 10.3 
; Section 10.4 
; Section 10.5 
; Section 10.7 
; Section 10.10 



7.2.1 Type 
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entity-body := Content-Encoding( Content-Type( data ) ) 

A Content-Type specifies the media type of the underlying data. A 
Content-Encoding may be used to indicate any additional content 
coding applied to the type, usually for the purpose of data 
compression, that is a property of the resource requested. The 
default for the content encoding is none (i.e., the identity 
function). 

Any HTTP/1.0 message containing an entity body should include a 
Content-Type header field defining the media type of that body. If 
and only if the media type is not given by a Content-Type header, as 
is the case for Simple-Response messages, the recipient may attempt 
to guess the media type via inspection of its content and/or the name 
extension (s) of the URL used to identify the resource. If the media 
type remains unknown, the recipient should treat it as type 
"appl ication/octet-stream". 

7.2.2 Length 

When an Entity-Body is included with a message, the length of that 
body may be determined in one of two ways. If a Con tent -Length header 
field is present, its value in bytes represents the length of the 
Entity-Body. Otherwise, the body length is determined by the closing 
of the connection by the server. 

Closing the connection cannot be used to indicate the end of a 
request body, since it leaves no possibility for the server to send 
back a response. Therefore, HTTP/1.0 requests containing an entity 
body must include a valid Content -Length header field. If a request 
contains an entity body and Con tent -Length is not specified, and the 
server does not recognize or cannot calculate the length from other 
fields, then the server should send a 400 (bad request) response. 

Note: Some older servers supply an invalid Con tent -Length when 
sending a document that contains server-side includes dynamically 
inserted into the data stream. It must be emphasized that this 
will not be tolerated by future versions of HTTP. Unless the 
client knows that it is receiving a response from a compliant 
server, it should not depend on the Content-Length value being 
correct. 

8. Method Definitions 

The set of common methods for HTTP/1.0 is defined below. Although 
this set can be expanded, additional methods cannot be assumed to 
share the same semantics for separately extended clients and servers. 
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8. 1 GET 

The GET method means retrieve whatever information (in the form of an 
entity) is identified by the Request-URL If the Request-URI refers 
to a data-producing process, it is the produced data which shall be 
returned as the entity in the response and not the source text of the 
process, unless that text happens to be the output of the process. 

The semantics of the GET method changes to a "conditional GET" if the 
request message includes an If-Modif ied-Since header field. A 
conditional GET method requests that the identified resource be 
transferred only if it has been modified since the date given by the 
If-Modif ied-Since header, as described in Section 10.9. The 
conditional GET method is intended to reduce network usage by 
allowing cached entities to be refreshed without requiring multiple 
requests or transferring unnecessary data. 

8. 2 HEAD 

The HEAD method is identical to GET except that the server must not 
return any Entity-Body in the response. The metainformat ion contained 
in the HTTP headers in response to a HEAD request should be identical 
to the information sent in response to a GET request. This method can 
be used for obtaining metainformat ion about the resource identified 
by the Request-URI wi thout transferring the Entity-Body itself. This 
method is often used for testing hypertext links for validity, 
accessibility, and recent modification. 

There is no "conditional HEAD" request analogous to the conditional 
GET. If an If-Modif ied-Since header field is included with a HEAD 
request, it should be ignored. 

8.3 POST 

The POST method is used to request that the destination server accept 
the entity enclosed in the request as a new subordinate of the 
resource identified by the Request-URI in the Request-Line. POST is 
designed to allow a uniform method to cover the following functions: 

o Annotation of existing resources; 

o Posting a message to a bulletin board, newsgroup, mailing list, 
or similar group of articles; 

o Providing a block of data, such as the result of submitting a 
form [3], to a data-handling process; 

o Extending a database through an append operation. 
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The actual function performed by the POST method is determined by the 
server and is usually dependent on the Request-URL The posted entity 
is subordinate to that URI in the same way that a file is subordinate 
to a directory containing it, a news article is subordinate to a 
newsgroup to which it is posted, or a record is subordinate to a 
database. 

A successful POST does not require that the entity be created as a 
resource on the origin server or made accessible for future 
reference. That is, the action performed by the POST method might not 
result in a resource that can be identified by a URI. In this case, 
either 200 (ok) or 204 (no content) is the appropriate response 
status, depending on whether or not the response includes an entity 
that describes the result. 

If a resource has been created on the origin server, the response 
should be 201 (created) and contain an entity (preferably of type 
"text/html") which describes the status of the request and refers to 
the new resource. 

A valid Content-Length is required on all HTTP/1.0 POST requests. An 
HTTP/1.0 server should respond with a 400 (bad request) message if it 
cannot determine the length of the request message's content. 

Applications must not cache responses to a POST request because the 
application has no way of knowing that the server would return an 
equivalent response on some future request. 

9. Status Code Definitions 

Each Status-Code is described below, including a description of which 
method(s) it can follow and any metainformat ion required in the 
response. 

9.1 Informational lxx 

This class of status code indicates a provisional response, 
consisting only of the Status-Line and optional headers, and is 
terminated by an empty line. HTTP/1.0 does not define any lxx status 
codes and they are not a valid response to a HTTP/1.0 request. 
However, they may be useful for experimental applications which are 
outside the scope of this specification. 

9.2 Successful 2xx 

This class of status code indicates that the client's request was 
successfully received, understood, and accepted. 
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200 OK 

The request has succeeded. The information returned with the 
response is dependent on the method used in the request, as follows: 

GET an entity corresponding to the requested resource is sent 
in the response; 

HEAD the response must only contain the header information and 
no Entity-Body; 

POST an entity describing or containing the result of the action. 

201 Created 

The request has been fulfilled and resulted in a new resource being 
created. The newly created resource can be referenced by the URI(sT 
returned in the entity of the response. The origin server should 
create the resource before using this Status-Code. If the action 
cannot be carried out immediately, the server must include in the 
response body a description of when the resource will be available; 
otherwise, the server should respond with 202 (accepted). 

Of the methods defined by this specification, only POST can create a 
resource. 

202 Accepted 

The request has been accepted for processing, but the processing 
has not been completed. The request may or may not eventually be 
acted upon, as it may be disallowed when processing actually takes 
place. There is no facility for re-sending a status code from an 
asynchronous operation such as this. 

The 202 response is intentional ly non-committal. Its purpose is to 
allow a server to accept a request for some other process (perhaps 
a batch-oriented process that is only run once per day) without 
requiring that the user agent's connection to the server persist 
until the process is completed. The entity returned with this 
response should include an indication of the request's current 
status and either a pointer to a status monitor or some estimate of 
when the user can expect the request to be fulfilled. 

204 No Content 

The server has fulfilled the request but there is no new 
information to send back. If the client is a user agent, it should 
not change its document view from that which caused the request to 
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be generated. This response is primarily intended to allow input 
for scripts or other actions to take place without causing a change 
to the user agent* s active document view. The response may include 
new metainformat ion in the form of entity headers, which should 
apply to the document currently in the user agent's active view. 

9.3 Redirection 3xx 

This class of status code indicates that further action needs to be 
taken by the user agent in order to fulfill the request. The action 
required may be carried out by the user agent without interaction 
with the user if and only if the method used in the subsequent 
request is GET or HEAD. A user agent should never automatically 
redirect a request more than 5 times, since such redirections usually 
indicate an infinite loop. 

300 Multiple Choices 

This response code is not directly used by HTTP/1.0 applications, 
but serves as the default for interpreting the 3xx class of 
responses. 

The requested resource is available at one or more locations. 
Unless it was a HEAD request, the response should include an entity 
containing a list of resource characteristics and locations from 
which the user or user agent can choose the one most appropriate. 
If the server has a preferred choice, it should include the URL in 
a Location field; user agents may use this field value for 
automatic redirection. 

301 Moved Permanently 

The requested resource has been assigned a new permanent URL and 
any future references to this resource should be done using that 
URL. Clients with link editing capabilities should automatically 
relink references to the Request-URI to the new reference returned 
by the server, where possible. 

The new URL must be given by the Location field in the response. 
Unless it was a HEAD request, the Entity-Body of the response 
should contain a short note with a hyperlink to the new URL. 

If the 301 status code is received in response to a request using 
the POST method, the user agent must not automatically redirect the 
request unless it can be confirmed by the user, since this might 
change the conditions under which the request was issued. 
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Note: When automatically redirecting a POST request after 
receiving a 301 status code, some existing user agents will 
erroneously change it into a GET request. 

302 Moved Temporarily 

The requested resource resides temporarily under a different URL. 
Since the redirection may be altered on occasion, the client should 
continue to use the Request-URI for future requests. 

The URL must be given by the Location field in the response. Unless 
it was a HEAD request, the Entity-Body of the response should 
contain a short note with a hyperlink to the new URI(s). 

If the 302 status code is received in response to a request using 
the POST method, the user agent must not automatically redirect the 
request unless it can be confirmed by the user, since this might 
change the conditions under which the request was issued. 

Note: When automatically redirecting a POST request after 
receiving a 302 status code, some existing user agents will 
erroneously change it into a GET request. 

304 Not Modified 

If the client has performed a conditional GET request and access is 
allowed, but the document has not been modified since the date and 
time specified in the If-Modi f ied-Since field, the server must 
respond with this status code and not send an Entity-Body to the 
client. Header fields contained in the response should only include 
information which is relevant to cache managers or which may have 
changed independently of the entity's Last-Modified date. Examples 
of relevant header fields include: Date, Server, and Expires. A 
cache should update its cached entity to reflect any new field 
values given in the 304 response. 

9.4 Client Error 4xx 

The 4xx class of status code is intended for cases in which the 
client seems to have erred. If the client has not completed the 
request when a 4xx code is received, it should immediately cease 
sending data to the server. Except when responding to a HEAD request, 
the server should include an entity containing an explanation of the 
error situation, and whether it is a temporary or permanent 
condition. These status codes are applicable to any request method. 
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Note: If the client is sending data, server implementations on TCP 
should be careful to ensure that the client acknowledges receipt 
of the packet (s) containing the response prior to closing the 
input connection. If the client continues sending data to the 
server after the close, the server's controller will send a reset 
packet to the client, which may erase the client's unacknowledged 
input buffers before they can be read and interpreted by the HTTP 
appl icat ion. 

400 Bad Request 

The request could not be understood by the server due to malformed 
syntax. The client should not repeat the request without 
modifications. 

401 Unauthorized 

The request requires user authentication. The response must include 
a WWW-Authenticate header field (Section 10.16) containing a 
challenge applicable to the requested resource. The client may 
repeat the request with a suitable Authorization header field 
(Section 10.2). If the request already included Authorization 
credentials, then the 401 response indicates that authorization has 
been refused for those credentials. If the 401 response contains 
the same challenge as the prior response, and the user agent has 
already attempted authentication at least once, then the user 
should be presented the entity that was given in the response, 
since that entity may include relevant diagnostic information. HTTP 
access authentication is explained in Section 11. 

403 Forbidden 

The server understood the request, but is refusing to fulfill it. 
Authorization will not help and the request should not be repeated. 
If the request method was not HEAD and the server wishes to make 
public why the request has not been fulfilled, it should describe 
the reason for the refusal in the entity body. This status code is 
commonly used when the server does not wish to reveal exactly why 
the request has been refused, or when no other response is 
applicable. 

404 Not Found 

The server has not found anything matching the Request-URI. No 
indication is given of whether the condition is temporary or 
permanent. If the server does not wish to make this information 
available to the client, the status code 403 (forbidden) can be 
used instead. 
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9.5 Server Error 5xx 

Response status codes beginning with the digit "5" indicate cases in 
which the server is aware that it has erred or is incapable of 
performing the request. If the client has not completed the request 
when a 5xx code is received, it should immediately cease sending data 
to the server. Except when responding to a HEAD request, the server 
should include an entity containing an explanation of the error 
situation, and whether it is a temporary or permanent condition. 
These response codes are applicable to any request method and there 
are no required header fields. 

500 Internal Server Error 

The server encountered an unexpected condition which prevented it 
from fulfilling the request. 

501 Not Implemented 

The server does not support the functionality required to fulfill 
the request. This is the appropriate response when the server does 
not recognize the request method and is not capable of supporting 
it for any resource. 

502 Bad Gateway 

The server, while acting as a gateway or proxy, received an invalid 
response from the upstream server it accessed in attempting to 
fulfill the request. 

503 Service Unavailable 

The server is currently unable to handle the request due to a 
temporary overloading or maintenance of the server. The implication 
is that this is a temporary condition which will be alleviated 
after some delay. 

Note: The existence of the 503 status code does not imply 
that a server must use it when becoming overloaded. Some 
servers may wish to simply refuse the connection. 

10. Header Field Definitions 

This section defines the syntax and semantics of all commonly used 
HTTP/1.0 header fields. For general and entity header fields, both 
sender and recipient refer to either the client or the server, 
depending on who sends and who receives the message. 
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10.1 Allow 

The Allow entity-header field lists the set of methods supported by 
the resource identified by the Request-URI. The purpose of this field 
is strictly to inform the recipient of valid methods associated with 
the resource. The Allow header field is not permitted in a request 
using the POST method, and thus should be ignored if it is received 
as part of a POST entity. 

Allow = "Allow" ":" l#method 

Example of use: 

Allow: GET, HEAD 

This field cannot prevent a client from trying other methods. 
However, the indications given by the Allow header field value should 
be followed. The actual set of allowed methods is defined by the 
origin server at the time of each request. 

A proxy must not modify the Allow header field even if it does not 
understand all the methods specified, since the user agent may have 
other means of communicating with the origin server. 

The Allow header field does not indicate what methods are implemented 
by the server. 

10.2 Authorization 

A user agent that wishes to authenticate itself with a server — 
usually, but not necessarily, after receiving a 401 response — may do 
so by including an Authorization request -header field with the 
request. The Authorization field value consists of credentials 
containing the authentication information of the user agent for the 
realm of the resource being requested. 

Authorization = "Authorization" ":" credentials 

HTTP access authentication is described in Section 11. If a request 
is authenticated and a realm specified, the same credentials should 
be valid for all other requests within this realm. 

Responses to requests containing an Authorization field are not 
cachable. 
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10.3 Content-Encoding 

The Content-Encoding entity-header field is used as a modifier to the 
media-type. When present, its value indicates what additional content 
coding has been applied to the resource, and thus what decoding 
mechanism must be applied in order to obtain the media-type 
referenced by the Content-Type header field. The Con tent -Encoding is 
primarily used to allow a document to be compressed without losing 
the identity of its underlying media type. 

Content-Encoding = "Con tent -Encoding" ":" content-coding 

Content codings are defined in Section 3.5. An example of its use is 

Con tent -Encoding: x-gzip 

The Content -Encoding is a characteristic of the resource identified 
by the Request-URI. Typically, the resource is stored with this 
encoding and is only decoded before rendering or analogous usage. 

10. 4 Content-Length 

The Content-Length entity-header field indicates the size of the 
Entity-Body, in decimal number of octets, sent to the recipient or, 
in the case of the HEAD method, the size of the Entity-Body that 
would have been sent had the request been a GET. 

Content-Length - "Content -Length" ":" 1*DIGIT 

An example is 

Content -Length: 3495 

Applications should use this field to indicate the size of the 
Entity-Body to be transferred, regardless of the media type of the 
entity. A valid Content -Length field value is required on all 
HTTP/1.0 request messages containing an entity body. 

Any Con tent -Length greater than or equal to zero is a valid value. 
Section 7.2.2 describes how to determine the length of a response 
entity body if a Content-Length is not given. 

Note: The meaning of this field is significantly different from 
the corresponding definition in MIME, where it is an optional 
field used within the "message/external -body" content-type. In 
HTTP, it should be used whenever the entity's length can be 
determined prior to being transferred. 
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10. 5 Content-Type 

The Content-Type entity-header field indicates the media type of the 
Entity-Body sent to the recipient or, in the case of the HEAD method, 
the media type that would have been sent had the request been a GET. 

Content-Type = "Content-Type" ":" media-type 

Media types are defined in Section 3.6. An example of the field is 

Content-Type: text/html 

Further discussion of methods for identifying the media type of ah 
entity is provided in Section 7.2.1. 

10.6 Date 

The Date general -header field represents the date and time at which 
the message was originated, having the same semantics as orig-date in 
RFC 822. The field value is an tfITP-date, as described in Section 
3.3. 

Date = "Date" ":" HTTP-date 

An example is 

Date: Tue, 15 Nov 1994 08:12:31 GMT 

If a message is received via direct connection with the user agent 
(in the case of requests) or the origin server (in the case of 
responses), then the date can be assumed to be the current date at 
the receiving end. However, since the date — as it is believed by the 
origin — is important for evaluating cached responses, origin servers 
should always include a Date header. Clients should only send a Date 
header field in messages that include an entity body, as in the case 
of the POST request, and even then it is optional. A received message 
which does not have a Date header field should be assigned one by the 
recipient if the message will be cached by that recipient or 
gatewayed via a protocol which requires a Date. 

In theory, the date should represent the moment just before the 
entity is generated. In practice, the date can be generated at any 
time during the message origination without affecting its semantic 
value. 

Note: An earlier version of this document incorrectly specified 
that this field should contain the creation date of the enclosed 
Entity-Body. This has been changed to reflect actual (and proper) 
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usage. 
10.7 Expires 

The Expires entity-header field gives the date/time after which the 
entity should be considered stale. This allows information providers 
to suggest the volatility of the resource, or a date after which the 
information may no longer be valid. Applications must not cache this 
entity beyond the date given. The presence of an Expires field does 
not imply that the original resource will change or cease to exist 
at, before, or after that time. However, information providers that 
know or even suspect that a resource will change by a certain date 
should include an Expires header with that date. The format is an 
absolute date and time as defined by HTTP-date in Section 3.3. 

Expires = "Expires" ":" HTTP-date 

An example of its use is 

Expires: Thu, 01 Dec 1994 16:00:00 GMT 

If the date given is equal to or earlier than the value of the Date 
header, the recipient must not cache the enclosed entity. If a 
resource is dynamic by nature, as is the case with many data- 
producing processes, entities from that resource should be given an 
appropriate Expires value which reflects that dynamism. 

The Expires field cannot be used to force a user agent to refresh its 
display or reload a resource; its semantics apply only to caching 
mechanisms, and such mechanisms need only check a resource's 
expiration status when a new request for that resource is initiated. 

User agents often have history mechanisms, such as "Back" buttons and 
history lists, which can be used to redisplay an entity retrieved 
earlier in a session. By default, the Expires field does not apply to 
history mechanisms. If the entity is still in storage, a history 
mechanism should display it even if the entity has expired, unless 
the user has specifically configured the agent to refresh expired 
history documents. 

Note: Applications are encouraged to be tolerant of bad or 
misinformed implementations of the Expires header. A value of zero 
(0) or an invalid date format should be considered equivalent to 
an "expires immediately." Although these values are not legitimate 
for HTTP/1.0, a robust implementation is always desirable. 
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10. 8 From 



The From request -header field, if given, should contain an Internet 
e-mail address for the human user who controls the requesting user 
agent. The address should be machine-usable, as defined by mailbox in 
RFC 822 [7] (as updated by RFC 1123 [6]): 



An example is: 

From: webmaster@w3.org 

This header field may be used for logging purposes and as a means for 
identifying the source of invalid or unwanted requests. It should not 
be used as an insecure form of access protection. The interpretation 
of this field is that the request is being performed on behalf of the 
person given, who accepts responsibility for the method performed. In 
particular, robot agents should include this header so that the 
person responsible for running the robot can be contacted if problems 
occur on the receiving end. 

The Internet e-mail address in this field may be separate from the 
Internet host which issued the request. For example, when a request 
is passed through a proxy, the original issuer's address should be 
used. 

Note: The client should not send the From header field without the 
user's approval, as it may conflict with the user's privacy 
interests or their site's security policy. It is strongly 
recommended that the user be able to disable, enable, and modify 
the value of this field at any time prior to a request. 

10.9 If-Modified-Since 

The If-Modif ied-Since request-header field is used with the GET 
method to make it conditional: if the requested resource has not been 
modified since the time specified in this field, a copy of the 
resource will not be returned from the server; instead, a 304 (not 
modified) response will be returned without any Entity-Body. 

If-Modif ied-Since = " If-Modif ied-Since" ":" HTTP-date 

An example of the field is: 



From 



= "From 1 



mai lbox 



If-Modif ied-Since: Sat, 29 Oct 1994 19:43:31 GMT 
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A conditional GET method requests that the identified resource be 
transferred only if it has been modified since the date given by the 
If-Modif ied-Since header. The algorithm for determining this includes 
the following cases: 

a) If the request would normally result in anything other than 
a 200 (ok) status, or if the passed If-Modif ied-Since date 
is invalid, the response is exactly the same as for a 
normal GET. A date which is later than the server's current 
time is invalid. 

b) If the resource has been modified since the 

If-Modif ied-Since date, the response is exactly the same as 
for a normal GET. 

c) If the resource has not been modified since a valid 
If-Modif ied-Since date, the server shall return a 304 (not 
mod i f i ed) response. 

The purpose of this feature is to allow efficient updates of cached 
information with a minimum amount of transaction overhead. 

10. 10 Last-Modified 

The Last-Modified entity-header field indicates the date and time at 
which the sender believes the resource was last modified. The exact 
semantics of this field are defined in terms of how the recipient 
should interpret it: if the recipient has a copy of this resource 
which is older than the date given by the Last-Modified field, that 
copy should be considered stale. 

Last-Modified = "Last-Modified" ":" HTTP-date 

An example of its use is 

Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT 

The exact meaning of this header field depends on the implementation 
of the sender and the nature of the original resource. For files, it 
may be just the file system last-modified time. For entities with 
dynamically included parts, it may be the most recent of the set of 
last-modify times for its component parts. For database gateways, it 
may be the last-update timestamp of the record. For virtual objects, 
it may be the last time the internal state changed. 

An origin server must not send a Last-Modified date which is later 
than the server's time of message origination. In such cases, where 
the resource's last modification would indicate some time in the 
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future, the server must replace that date with the message 
origination date. 

10.11 Location 

The Location response-header field defines the exact location of the 
resource that was identified by the Request-URI. For 3xx responses, 
the location must indicate the server's preferred URL for automatic 
redirection to the resource. Only one absolute URL is allowed. 

Location = "Location" ":" absoluteURI 

An example is 

Location: http://www. w3.org/hypertext/WWW/NewLocation. html 

10. 12 Pragma 

The Pragma genera 1 -header field is used to include implementation- 
specific directives that may apply to any recipient along the 
request/response chain. All pragma directives specify optional 
behavior from the viewpoint of the protocol; however, some systems 
may require that behavior be consistent with the directives. 

Pragma = "Pragma" ":" ltfpragma-directive 

pragma-directive = "no-cache" I extension-pragma 
extension-pragma = token [ "=" word ] 

When the "no-cache" directive is present in a request message, an 
application should forward the request toward the origin server even 
if it has a cached copy of what is being requested. This allows a 
client to insist upon receiving an authoritative response to its 
request. It also allows a client to refresh a cached copy which is 
known to be corrupted or stale. 

Pragma directives must be passed through by a proxy or gateway 
application, regardless of their significance to that application, 
since the directives may be applicable to all recipients along the 
request/response chain. It is not possible to specify a pragma for a 
specific recipient; however, any pragma directive not relevant to a 
recipient should be ignored by that recipient. 

10. 13 Ref erer 

The Referer request-header field allows the client to specify, for 
the server's benefit, the address (URI) of the resource from which 
the Request-URI was obtained. This allows a server to generate lists 
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of back-links to resources for interest, logging, optimized caching, 
etc. It also allows obsolete or mistyped links to be traced for 
maintenance. The Referer field must not be sent if the Request-URI 
was obtained from a source that does not have its own URI, such as 
input from the user keyboard. 

Referer = -"Referer" ":" ( absoluteURI I relativeURI ) 

Example: 

Ref er er : ht tp : //www. w3 . org/hyper t ex t/Da t aSources/Overvi ew. htm 1 

If a partial URI is given, it should be interpreted relative to the 
Request-URI. The URI must not include a fragment. 

Note: Because the source of a link may be private information or 
may reveal an otherwise private information source, it is strongly 
recommended that the user be able to select whether or not the 
Referer field is sent. For example, a browser client could have a 
toggle switch for browsing openly/anonymously, which would 
respectively enable/disable the sending of Referer and From 
information. 

10. 14 Server 

The Server response-header field contains information about the 
software used by the origin server to handle the request. The field 
can contain multiple product tokens (Section 3.7) and comments 
identifying the server and any significant subproducts. By- 
convent ion, the product tokens are listed in order of their 
significance for identifying the application. 

Server = "Server" ":" 1*( product I comment ) 

Example: 

Server: CERN/3. 0 1 ibwww/2. 17 

If the response is being forwarded through a proxy, the proxy 
application must not add its data to the product list. 

Note: Revealing the specific software version of the server may 
allow the server machine to become more vulnerable to attacks 
against software that is known to contain security holes. Server 
imp lemen tors are encouraged to make this field a configurable 
option. 
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Note: Some existing servers fail to restrict themselves to the 
product token syntax within the Server field. 

10. 15 User-Agent 

The User-Agent request -header field contains information about the 
user agent originating the request. This is for statistical purposes, 
the tracing of protocol violations, and automated recognition of user 
agents for the sake of tailoring responses to avoid particular user 
agent limitations. Although it is not required, user agents should 
include this field with requests. The field can contain multiple 
product tokens (Section 3.7) and comments identifying the agent and 
any subproducts which form a significant part of the user agent. By 
convention, the product tokens are listed in order of their 
significance for identifying the application. 

User-Agent = "User-Agent" ":" 1*( product I comment ) 

Examp 1 e : 

User-Agent: CERN-LineMode/2. 15 libwww/2. 17b3 

Note: Some current proxy applications append their product 
information to the list in the User-Agent field. This is not 
recommended, since it makes machine interpretation of these 
fields ambiguous. 

Note: Some exist ing cl ients fail to restrict themselves to 
the product token syntax within the User-Agent field. 

10. 16 WWW-Authenticate 

The WWW-Authenticate response-header field must be included in 401 
(unauthorized) response messages. The field value consists of at 
least one challenge that indicates the authentication scheme(s) and 

parameters applicable to the Request-URI. 

WWW-Authenticate = "WWW-Authenticate" ":" ltfchallenge 

The HTTP access authentication process is described in Section 11. 
User agents must take special care in parsing the WWW-Authenticate 
field value if it contains more than one challenge, or if more than 
one WWW-Authenticate header field is provided, since the contents of 
a challenge may itself contain a comma-separated list of 
authentication parameters. 
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11. Access Authentication 

HTTP provides a simple challenge-response authentication mechanism 
which may be used by a server to challenge a client request and by a 
client to provide authentication information. It uses an extensible, 
case-insensitive token to identify the authentication scheme, 
followed by a comma-separated list of attribute-value pairs which 
carry the parameters necessary for achieving authentication via that 
scheme. 

auth-scheme = token 

auth-param = token "=" quoted-string 

The 401 (unauthorized) response message is used by an origin server 
to challenge the authorization of a user agent. This response must 
include a WWW-Authent icate header field containing at least one 
challenge applicable to the requested resource. 

challenge = auth-scheme 1*SP realm *( V auth-param ) 

realm = " realm" realm-value 

realm-value = quoted-string 

The realm attribute (case-insensi t ive) is required for all 
authentication schemes which issue a challenge. The realm value 
(case-sensitive), in combination with the canonical root URL of the 
server being accessed, defines the protection space. These realms 
allow the protected resources on a server to be partitioned into a 
set of protection spaces, each with its own authentication scheme 
and/or authorization database. The realm value is a string, generally 
assigned by the origin server, which may have additional semantics 
specific to the authentication scheme. 

A user agent that wishes to authenticate itself with a server — 
usually, but not necessarily, after receiving a 401 response — may do 
so by including an Authorization header field with the request. The 
Authorization field value consists of credentials containing the 
authentication information of the user agent for the realm of the 
resource being requested. 

credentials = basic-credentials 

I ( auth-scheme #auth-param ) 

The domain over which credentials can be automatically applied by a 
user agent is determined by the protection space. If a prior request 
has been authorized, the same credentials may be reused for all other 
requests within that protection space for a period of time determined 
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by the authentication scheme, parameters, and/or user preference. 
Unless otherwise defined by the authentication scheme, a single 
protection space cannot extend outside the scope of its server. 

If the server does not wish to accept the credentials sent with a 
request, it should return a 403 (forbidden) response. 

The HTTP protocol does not restrict applications to this simple 
challenge-response mechanism for access authentication. Additional 
mechanisms may be used, such as encryption at the transport level or 
via message encapsulation, and with additional header fields 
specifying authentication information. However, these additional 
mechanisms are not defined by this specification. 

Proxies must be completely transparent regardin g u ser agent 
authentication. That is, they must forward the WWW-Authent icate and 
Authorization headers untouched, and must not cache the response to a 
request containing Authorization. HTTP/1.0 does not provide a means 
for a client to be authenticated with a proxy. 

11.1 Basic Authentication Scheme 

The "basic" authentication scheme is based on the model that the user 
agent must authenticate itself with a user-ID and a password for each 
realm. The realm value should be considered an opaque string which 
can only be compared for equality with other realms on that server. 
The server will authorize the request only if it can validate the 
user-ID and password for the protection space of the Request-URL 
There are no optional authentication parameters. 

Upon receipt of an unauthorized request for a URI within the 
protection space, the server should respond with a challenge like the 
following: 

WWW-Authenticate: Basic realm="Wal lyWorld" 

where "WallyWorld" is the string assigned by the server to identify 
the protection space of the Request -URI. 

To receive authorization, the client sends the user-ID and password, 
separated by a single colon (":") character, within a base64 [5] 
encoded string in the credentials. 

basic-credentials = "Basic" SP basic-cookie 

basic-cookie = <base64 [5] encoding of user id-password, 

except not limited to 76 char/1 ine> 
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user id-password = [ token ] ":" TEXT 

If the user agent wishes to send the user-ID "Aladdin" and password 
"open sesame", it would use the following header field: 

Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ== 

The basic authentication scheme is a non-secure method of filtering 
unauthorized access to resources on an HTTP server. It is based on 
the assumption that the connection between the client and the server 
can be regarded as a trusted carrier. As this is not generally true 
on an open network, the basic authentication scheme should be used 
accordingly. In spite of this, clients should implement the scheme in 
order to communicate with servers that use it. 

12. Security Considerations 

This section is meant to inform application developers, information 
providers, and users of the security limitations in HTTP/1.0 as 
described by this document. The discussion does not include 
definitive solutions to the problems revealed, though it does make 
some suggestions for reducing security risks. 

12.1 Authentication of Clients 

As mentioned in Section 11.1, the Basic authentication scheme is not 
a secure method of user authentication, nor does it prevent the 
Entity-Body from being transmitted in clear text across the physical 
network used as the carrier. HTTP/1.0 does not prevent additional 
authentication schemes and encryption mechanisms from being employed 
to increase security. 

12.2 Safe Methods 

The writers of client software should be aware that the software 
represents the user in their interactions over the Internet, and 
should be careful to allow the user to be aware of any actions they 
may take which may have an unexpected significance to themselves or 
others. 

In particular, the convention has been established that the GET and 
HEAD methods should never have the significance of taking an action 
other than retrieval. These methods should be considered "safe." This 
allows user agents to represent other methods, such as POST, in a 
special way, so that the user is made aware of the fact that a 
possibly unsafe action is being requested. 
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Naturally, it is not possible to ensure that the server does not 
generate side-effects as a result of performing a GET request; in 
fact, some dynamic resources consider that a feature. The important 
distinction here is that the user did not request the side-effects, 
so therefore cannot be held accountable for them. 

12.3 Abuse of Server Log Information 

A server is in the position to save personal data about a user's 
requests which may identify their reading patterns or subjects of 
interest. This information is clearly confidential in nature and its 
handling may be constrained by law in certain countries. People using 
the HTTP protocol to provide data are responsible for ensuring that 
such material is not distributed without the permission of any 
individuals that are identifiable by the published results. 

12.4 Transfer of Sensitive Information 

Like any generic data transfer protocol, HTTP cannot regulate the 
content of the data that is transferred, nor is there any a priori 
method of determining the sensitivity of any particular piece of 
information within the context of any given request. Therefore, 
applications should supply as much control over this information as 
possible to the provider of that information. Three header fields are 
worth special mention in this context: Server, Referer and From. 

Revealing the specific software version of the server may allow the 
server machine to become more vulnerable to attacks against software 
that is known to contain security holes. Implementors should make the 
Server header field a configurable option. 

The Referer field allows reading patterns to be studied and reverse 
links drawn. Although it can be very useful, its power can be abused 
if user details are not separated from the information contained in 
the Referer. Even when the personal information has been removed, the 
Referer field may indicate a private document's URI whose publication 
would be inappropriate. 

The information sent in the From field might conflict with the user's 
privacy interests or their site's security policy, and hence it 
should not be transmitted without the user being able to disable, 
enable, and modify the contents of the field. The user must be able 
to set the contents of this field within a user preference or 
application defaults configuration. 

We suggest, though do not require, that a convenient toggle interface 
be provided for the user to enable or disable the sending of From and 
Referer information. 
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12.5 Attacks Based On File and Path Names 

Implementations of HTTP origin servers should be careful to restrict 
the documents returned by HTTP requests to be only those that were 
intended by the server administrators. If an HTTP server translates 
HTTP URIs directly into file system calls, the server must take 
special care not to serve files that were not intended to be 
delivered to HTTP clients. For example, Unix, Microsoft Windows, and 
other operating systems use as a path component to indicate a 

directory level above the current one. On such a system, an HTTP 
server must disallow any such construct in the Request-URI if it 
would otherwise allow access to a resource outside those intended to 
be accessible via the HTTP server. Similarly, files intended for 
reference only internally to the server (such as access control 
files, configuration files, and script code) must be protected from 
inappropriate retrieval, since they might contain sensitive 
information. Experience has shown that minor bugs in such HTTP server 
implementations have turned into security risks. 
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Appendices 

These appendices are provided for informational reasons only 
do not form a part of the HTTP/1.0 specification. 

A. Internet Media Type message/http 



they 



In addition to defining the HTTP/1.0 protocol, this document serves 
as the specification for the Internet media type "message/http". The 
following is to be registered with IANA [13]. 



Media Type name: 
Media subtype name: 
Required parameters: 
Optional parameters: 



message 

http 

none 

version, msgtype 



version: The HTTP-Version number of the enclosed message 

(e.g., "1.0"). If not present, the version can be 
determined from the first line of the body. 

msgtype: The message type — "request" or "response". If 
not present, the type can be determined from the 
first line of the body. 

Encoding considerations: only "7bit", "8bit", or "binary" are 

permitted 

Security considerations: none 

B. Tolerant Applications 

Although this document specifies the requirements for the generation 
of HTTP/1.0 messages, not all applications will be correct in their 
implementation. We therefore recommend that operational applications 
be tolerant of deviations whenever those deviations can be 
interpreted unambiguously. 

Clients should be tolerant in parsing the Status-Line and servers 
tolerant when parsing the Request-Line. In particular, they should 
accept any amount of SP or HT characters between fields, even though 
only a single SP is required. 

The line terminator for HTTP-header fields is the sequence CRLF. 
However, we recommend that applications, when parsing such headers, 
recognize a single LF as a line terminator and ignore the leading CR. 
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C. Relationship to MIME 

HTTP/1.0 uses many of the constructs defined for Internet Mail (RFC 
822 [7]) and the Multipurpose Internet Mail Extensions (MIME [5]) to 
allow entities to be transmitted in an open variety of 
representations and with extensible mechanisms. However, RFC 1521 
discusses mail, and HTTP has a few features that are different than 
those described in RFC 1521. These differences were carefully chosen 
to optimize performance over binary connections, to allow greater 
freedom in the use of new media types, to make date comparisons 
easier, and to acknowledge the practice of some early HTTP servers 
and cl ients. 

At the time of this writing, it is expected that RFC 1521 will be 
revised. The revisions may include some of the practices found in 
HTTP/LO but not in RFC 1521. 

This appendix describes specific areas where HTTP differs from RFC 
1521. Proxies and gateways to strict MIME environments should be 
aware of these differences and provide the appropriate conversions 
where necessary. Proxies and gateways from MIME environments to HTTP 
also need to be aware of the differences because some conversions may 
be required. 

C. 1 Conversion to Canonical Form 

RFC 1521 requires that an Internet mail entity be converted to 
canonical form prior to being transferred, as described in Appendix G 
of RFC 1521 [5]. Section 3.6.1 of this document describes the forms 
allowed for subtypes of the "text" media type when transmitted over 
HTTP. 

RFC 1521 requires that content with a Content-Type of "text" 
represent line breaks as CRLF and forbids the use of CR or LF outside 
of line break sequences. HTTP allows CRLF, bare CR, and bare LF to 
indicate a line b reak within text content when a message is 
transmitted over HTTP. 

Where it is possible, a proxy or gateway from HTTP to a strict RFC 
1521 environment should translate all line breaks within the text 
media types described in Section 3.6.1 of this document to the RFC 
1521 canonical form of CRLF. Note, however, that this may be 
complicated by the presence of a Con tent -Encoding and by the fact 
that HTTP allows the use of some character sets which do not use 
octets 13 and 10 to represent CR and LF, as is the case for some 
multi-byte character sets. 
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C. 2 Conversion of Date Formats 

HTTP/1.0 uses a restricted set of date formats (Section 3.3) to 
simplify the process of date comparison. Proxies and gateways from 
other protocols should ensure that any Date header field present in a 
message conforms to one of the HTTP/1. 0 formats and rewrite the date 
if necessary. 

C.3 Introduction of Con tent -Encoding 

RFC 1521 does not include any concept equivalent to HTTP/1. O's 
Content-Encoding header field. Since this acts as a modifier on the 
media type, proxies and gateways from HTTP to MIME-comp 1 i ant 
protocols must either change the value of the Content-Type header 
field or decode the Entity-Body before forwarding the message. (Some 
experimental applications of Content-Type for Internet mail have used 
a media-type parameter of " ;conversions=<content-coding>" to perform 
an equivalent function as Content-Encoding. However, this parameter 
is not part of RFC 1521.) 

C.4 No Content-Transfer-Encoding 

HTTP does not use the Content-Transfer-Encoding (CTE) field of RFC 
1521. Proxies and gateways from MIME-comp 1 i ant protocols to HTTP must 
remove any non-identity CTE ("quoted-printable" or "base64") encoding 
prior to delivering the response message to an HTTP client. 

Proxies and gateways from HTTP to MIME-comp 1 i ant protocols are 
responsible for ensuring that the message is in the correct format 
and encoding for safe transport on that protocol, where "safe 
transport'* is defined by the limitations of the protocol being used. 
Such a proxy or gateway should label the data with an appropriate 
Content-Transfer-Encoding if doing so will improve the likelihood of 
safe transport over the destination protocol. 

C. 5 HTTP Header Fields in Multipart Body-Parts 

In RFC 1521, most header fields in multipart body-parts are generally 
ignored unless the field name begins with "Content-". In HTTP/1.0, 
multipart body-parts may contain any HTTP header fields which are 
significant to the meaning of that part. 

D. Additional Features 

This appendix documents protocol elements used by some existing HTTP 
implementations, but not consistently and correctly across most 
HTTP/1.0 applications. Implementors should be aware of these 
features, but cannot rely upon their presence in, or interoperability 
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with, other HTTP/1.0 applications. 
D. 1 Additional Request Methods 
D. 1.1 PUT 

The PUT method requests that the enclosed entity be stored under the 
supplied Request-URL If the Request-URI refers to an already 
existing resource, the enclosed entity should be considered as a 
modified version of the one residing on the origin server. If the 
Request-URI does not point to an existing resource, and that URI is 
capable of being defined as a new resource by the requesting user 
agent, the origin server can create the resource with that URI. 

The fundamental difference between the POST and PUT requests is 
reflected in the different meaning of the Request-URI. The URI in a 
POST request identifies the resource that will handle the enclosed 
entity as data to be processed. That resource may be a data-accepting 
process, a gateway to some other protocol, or a separate entity that 
accepts annotations. In contrast, the URI in a PUT request identifies 
the entity enclosed with the request — the user agent knows what URI 
is intended and the server should not apply the request to some other 
resource. 

D. 1.2 DELETE 

The DELETE method requests that the origin server delete the resource 
identified by the Request-URI. 

D.1.3 LINK 

The LINK method establishes one or more Link relationships between 
the existing resource identified by the Request-URI and other 
existing resources. 

D. 1.4 UNLINK 

The UNLINK method removes one or more Link relationships from the 
existing resource identified by the Request-URI. 

D. 2 Additional Header Field Definitions 

D. 2.1 Accept 

The Accept request -header field can be used to indicate a list of 
media ranges which are acceptable as a response to the request. The 
asterisk character is used to group media types into ranges, with 
"*/*" indicating all media types and "type/*" indicating all subtypes 
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of that type. The set of ranges given by the client should represent 
what types are acceptable given the context of the request. 

D.2.2 Accept -Chars et 

The Accept-Charset request -header field can be used to indicate a 
list of preferred character sets other than the default US-ASCII and 
ISO-8859-1. This field allows clients capable of understanding more 
comprehensive or special -purpose character sets to signal that 
capability to a server which is capable of representing documents in 
those character sets. 

D. 2.3 Accept -Encoding 

The Accept-Encoding request -header field is similar to Accept, but 
restricts the con tent -coding values which are acceptable in the 
response. 

D.2.4 Accept -Language 

The Accept -Language request -header field is similar to Accept, but 
restricts the set of natural languages that are preferred as a 
response to the request. 

D. 2. 5 Con tent -Language 

The Content-Language entity-header field describes the natural 
language(s) of the intended audience for the enclosed entity. Note 
that this may not be equivalent to all the languages used within the 
entity. 

D.2.6 Link 

The Link entity-header field provides a means for describing a 
relationship between the entity and some other resource. An entity 
may include multiple Link values. Links at the metainformat ion level 
typically indicate relationships like hierarchical structure and 
navigation paths. 

D.2.7 MIME-Version 

HTTP messages may include a single MIME-Version general -header field 
to indicate what version of the MIME protocol was used to construct 
the message. Use of the MIME-Version header field, as defined by RFC 
1521 [5], should indicate that the message is MIME-conformant. 
Unfortunately, some older HTTP/1.0 servers send it indiscriminately, 
and thus this field should be ignored. 
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D.2.8 Retry-After 

The Retry-After response-header field can be used with a 503 (service 
unavailable) response to indicate how long the service is expected to 
be unavailable to the requesting client. The value of this field can 
be either an HTTP-date or an integer number of seconds (in decimal) 
after the time of the response. 

D.2.9 Title 

The Title entity-header field indicates the title of the entity. 
D.2.10URI 

The URI entity-header field may contain some or all of the Uniform 
Resource Identifiers (Section 3.2) by which the Request-URI resource 
can be identified. There is no guarantee that the resource can be 
accessed using the URI(s) specified. 
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Hypertext Transfer Protocol — HTTP/1. 1 



Status of this Memo 

This document specifies an Internet standards track protocol for the 
Internet community, and requests discussion and suggestions for 
improvements. Please refer to the current edition of the "Internet 
Official Protocol Standards" (STD l) for the standardization state 
and status of this protocol. Distribution of this memo is unlimited. 



The Hypertext Transfer Protocol (HTTP) is an appl l cat ion- level 
protocol for distributed, collaborative, hypermedia information 
systems. It is a generic, stateless, object-oriented protocol which 
can be used for many tasks, such as name servers and distributed 
object management systems, through extension of its request methods. 
A feature of HTTP is the typing and negotiation of data 
representation, allowing systems to be built independently of the 
data being transferred. 

HTTP has been in use by the World-Wide Web global information 
initiative since 1990. This specification defines the protocol 
referred to as "HTTP/1. 1". 
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1 Introduction 

1 . 1 Purpose 

The Hypertext Transfer Protocol (HTTP) is an application-level 
protocol for distributed, collaborative, hypermedia information 
systems. tfTTP has been in use by the World-Wide Web global 
information initiative since 1990. The first version of HTTP, 
referred to as HTTP/0.9, was a simple protocol for raw data transfer 
across the Internet. HTTP/1.0, as defined by RFC 1945 [6], improved 
the protocol by allowing messages to be in the format of MIME-like 
messages, containing metainformation about the data transferred and 
modifiers on the request /response semantics. However, HTTP/1.0 does 
not sufficiently take into consideration the effects of hierarchical 
proxies, caching, the need for persistent connections, and virtual 
hosts. In addition, the proliferation of incompletely-implemented 
applications calling themselves "HTTP/1. 0" has necessitated a 
protocol version change in order for two communicating applications 
to determine each other's true capabilities. 

This specification defines the protocol referred to as "HTTP/1. 1". 
This protocol includes more stringent requirements than HTTP/1.0 in 
order to ensure reliable implementation of its features. 

Practical information systems require more functionality than simple 
retrieval, including search, front-end update, and annotation. HTTP 
allows an open-ended set of methods that indicate the purpose of a 
request. It builds on the discipline of reference provided by the 
Uniform Resource Identifier (URI) [3][20], as a location (URL) [4] or 
name (URN) , for indicating the resource to which a method is to be 
applied. Messages are passed in a format similar to that used by 
Internet mail as defined by the Multipurpose Internet Mail Extensions 
(MIME) . 

HTTP is also used as a generic protocol for communication between 
user agents and proxies/gateways to other Internet systems, including 
those supported by the SMTP [16], NNTP [13], FTP [18], Gopher [2], 
and WAIS [10] protocols. In this way, HTTP allows basic hypermedia 
access to resources available from diverse applications. 

1.2 Requirements 

This specification uses the same words as RFC 1123 [8] for defining 
the significance of each particular requirement. These words are: 

MUST 

This word or the adjective "required" means that the item is an 
absolute requirement of the specification. 
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SHOULD 

This word or the adjective "recommended" means that there may 
exist valid reasons in particular circumstances to ignore this 
item, but the full implications should be understood and the case 
carefully weighed before choosing a different course. 

MAY 

This word or the adjective "optional" means that this item is 
truly optional. One vendor may choose to include the item because 
a particular marketplace requires it or because it enhances the 
product, for example; another vendor may omit the same item. 

An implementation is not compliant if it fails to satisfy one or more 
of the MUST requirements for the protocols it implements. An 
implementation that satisfies all the MUST and all the SHOULD 
requirements for its protocols is said to be "unconditionally 
compliant"; one that satisfies all the MUST requirements but not all 
the SHOULD requirements for its protocols is said to be 
"conditionally compliant." 

1. 3 Terminology 

This specification uses a number of terms to refer to the roles 
played by participants in, and objects of, the HTTP communication. 

connection 

A transport layer virtual circuit established between two programs 
for the purpose of communication. 

message 

The basic unit of HTTP communication, consisting of a structured 
sequence of octets matching the syntax defined in section 4 and 
transmitted via the connection. 

request 

An HTTP request message, as defined in section 5. 
response 

An HTTP response message, as defined in section 6. 
resource 

A network data object or service that can be identified by a URI, 
as defined in section 3.2. Resources may be available in multiple 
representations (e.g. multiple languages, data formats, size, 
resolutions) or vary in other ways. 
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entity 

The information transferred as the payload of a request or 
response. An entity consists of metainformat ion in the form of 
entity-header fields and content in the form of an entity-body, as 
described in section 7. 

representation 

An entity included with a response that is subject to content 
negotiation, as described in section 12. There may exist multiple 
representations associated with a particular response status. 

content negotiation 

The mechanism for selecting the appropriate representation when 
servicing a request, as described in section 12. The 
representation of entities in any response can be negotiated 
(including error responses). 

variant 

A resource may have one, or more than one, representation(s) 
associated with it at any given instant. Each of these 
representations is termed a variant.* Use of the term variant' 
does not necessarily imply that the resource is subject to content 
negotiation. 

cl ient 

A program that establishes connections for the purpose of sending 
requests. 

user agent 

The client which initiates a request. These are often browsers, 
editors, spiders (web- traversing robots), or other end user tools. 

server 

An application program that accepts connections in order to 
service requests by sending back responses. Any given program may 
be capable of being both a client and a server; our use of these 
terms refers only to the role being performed by the program for a 
particular connection, rather than to the program's capabilities 
in general. Likewise, any server may act as an origin server, 
proxy, gateway, or tunnel, switching behavior based on the nature 
of each request. 

origin server 

The server on which a given resource resides or is to be created. 
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proxy 

An intermediary program which acts as both a server and a client 
for the purpose of making requests on behalf of other clients. 
Requests are serviced internally or by passing them on, with 
possible translation, to other servers. A proxy must implement 
both the client and server requirements of this specification. 

gateway 

A server which acts as an intermediary for some other server. 
Unlike a proxy, a gateway receives requests as if it were the 
origin server for the requested resource; the requesting client 
may not be aware that it is communicating with a gateway. 

tunnel 

An intermediary program which is acting as a blind relay between 
two connections. Once active, a tunnel is not considered a party 
to the HTTP communication, though the tunnel may have been 
initiated by an HTTP request. The tunnel ceases to exist when both 
ends of the relayed connections are closed. 

cache 

A program* s local store of response messages and the subsystem 
that controls its message storage, retrieval, and deletion. A 
cache stores cachable responses in order to reduce the response 
time and network bandwidth consumption on future, equivalent 
requests. Any client or server may include a cache, though a cache 
cannot be used by a server that is acting as a tunnel. 

cachab 1 e 

A response is cachable if a cache is allowed to store a copy of 
the response message for use in answering subsequent requests. The 
rules for determining the cachability of HTTP responses are 
defined in section 13. Even if a resource is cachable, there may 
be additional constraints on whether a cache can use the cached 
copy for a particular request. 

first-hand ~ — — 

A response is first-hand if it comes directly and without 
unnecessary delay from the origin server, perhaps via one or more 
proxies. A response is also first-hand if its validity has just 
been checked directly with the origin server. 

explicit expiration time 

The time at which the origin server intends that an entity should 
no longer be returned by a cache without further validation. 
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heuristic expiration time 

An expiration time assigned by a cache when no explicit expiration 
time is avai lable. 

age 

The age of a response is the time since it was sent by, or 
successfully validated with, the origin server. 

freshness lifetime 

The length of time between the generation of a response and its 
expiration time. 

fresh 

A response is fresh if its age has not yet exceeded its freshness 
lifetime. 

stale 

A response is stale if its age has passed its freshness lifetime. 

semantical ly transparent 

A cache behaves in a "semantical ly transparent" manner, with 
respect to a particular response, when its use affects neither the 
requesting client nor the origin server, except to improve 
performance. When a cache is semantical ly transparent, the client 
receives exactly the same response (except for hop-by-hop headers) 
that it would have received had its request been handled directly 
by the origin server. 

validator 

A protocol element (e.g., an entity tag or a Last-Modified time) 
that is used to find out whether a cache entry is an equivalent 
copy of an entity. 

1.4 Overall Operation 

The HTTP protocol is a request/response protocol. A client sends a 
request to the server in the form of a request method, URI, and 
protocol version, followed by a MIME-like message containing request 
modifiers, client information, and possible body content over a 
connection with a server. The server responds with a status line, 
including the message's protocol version and a success or error code, 
followed by a MIME-like message containing server information, entity 
metainformation, and possible entity-body content. The relationship 
between HTTP and MIME is described in appendix 19.4. 
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Most HTTP communication is initiated by a user agent and consists of 
a request to be applied to a resource on some origin server. In the 
simplest case, this may be accomplished via a single connection (v) 
between the user agent (UA) and the origin server (0). 

request chain > 

UA v 0 

< response chain 

A more complicated situation occurs when one or more intermediaries 
are present in the request /response chain. There are three common 
forms of intermediary: proxy, gateway, and tunnel. A proxy is a 
forwarding agent, receiving requests for a URI in its absolute form, 
rewriting all or part of the message, and forwarding the reformatted 
request toward the server identified by the URI. A gateway is a 
receiving agent, acting as a layer above some other server (s) and, if 
necessary, translating the requests to the underlying server's 
protocol. A tunnel acts as a relay point between two connections 
without changing the messages; tunnels are used when the 
communication needs to pass through an intermediary (such as a 
firewall) even when the intermediary cannot understand the contents 
of the messages. 

request chain > 

UA v— — A v B v C v 0 

< response chain 

The figure above shows three intermediaries (A, B, and C) between the 
user agent and origin server. A request or response message that 
travels the whole chain will pass through four separate connections. 
This distinction is important because some HTTP communication options 
may apply only to the connection with the nearest, non-tunnel 
neighbor, only to the end-points of the chain, or to all connections 
along the chain. Although the diagram is linear, each participant 
may be engaged in multiple, simultaneous communications. For example, 
B may be receiving requests from many clients other than A, and/or 
forwarding requests to servers other than C, at the same time that it 
is handling A's request. 

Any party to the communication which is not acting as a tunnel may 
employ an internal cache for handling requests. The effect of a cache 
is that the request/response chain is shortened if one of the 
participants along the chain has a cached response applicable to that 
request. The following illustrates the resulting chain if B has a 
cached copy of an earlier response from 0 (via C) for a request which 
has not been cached by UA or A. 
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request chain > 

UA v A v — -- B C 0 

< response chain 

Not all responses are usefully cachable, and some requests may 
contain modifiers which place special requirements on cache behavior. 
HTTP requirements for cache behavior and cachable responses are 
defined in section 13. 

In fact, there are a wide variety of architectures and configurations 
of caches and proxies currently being experimented with or deployed 
across the World Wide Web; these systems include national hierarchies 
of proxy caches to save transoceanic bandwidth, systems that 
broadcast or multicast cache entries, organizations that distribute 
subsets of cached data via CD-ROM, and so on. HTTP systems are used 
in corporate intranets over high-bandwidth links, and for access via 
PDAs with low-power radio links and intermittent connectivity. The 
goal of HTTP/1.1 is to support the wide diversity of configurations 
already deployed while introducing protocol constructs that meet the 
needs of those who build web applications that require high 
reliability and, failing that, at least reliable indications of 
failure. 

HTTP communication usually takes place over TCP/IP connections. The 
default port is TCP 80, but other ports can be used. This does not 
preclude HTTP from being implemented on top of any other protocol on 
the Internet, or on other networks. HTTP only presumes a reliable 
transport; any protocol that provides such guarantees can be used; 
the mapping of the HTTP/1.1 request and response structures onto the 
transport data units of the protocol in question is outside the scope 
of this specification. 

In HTTP/1.0, most implementations used a new connection for each 
request/response exchange. In HTTP/1.1, a connection may be used for 
one or more request/response exchanges, although connections may be 
closed for a variety of reasons (see section 8.1). 

2 Notational Conventions and Generic Grammar 

2. 1 Augmented BNF 

All of the mechanisms specified in this document are described in 
both prose and an augmented Backus-Naur Form (BNF) similar to that 
used by RFC 822 [9]. Implementers will need to be familiar with the 
notation in order to understand this specification. The augmented BNF 
includes the following constructs: 
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name = definition 

The name of a rule is simply the name itself (without any enclosing 
"<" and ">") and is separated from its definition by the equal "=" 
character. Whitespace is only significant in that indentation of 
continuation lines is used to indicate a rule definition that spans 
more than one line. Certain basic rules are in uppercase, such as 
SP, LWS, HT, CRLF, DIGIT, ALPHA, etc. Angle brackets are used 
within definitions whenever their presence will facilitate 
discerning the use of rule names. 

"literal" . 

Quotation marks surround literal text. Unless stated otherwise, the 
text is case-insensi tive. 

rulel I rule2 

Elements separated by a bar ("1") are alternatives, e.g., yes I 
no" will accept yes or no. 

(rulel rule2) 

Elements enclosed in parentheses are treated as a single element. 
Thus, "(elem (foo I bar) elem)" allows the token sequences "elem 
foo elem" and "elem bar elem". 

* ru ^ e • • T , 

The character "*" preceding an element indicates repetition. The 

full form is "<n>*<m>element" indicating at least <n> and at most 

<m> occurrences of element. Default values are 0 and infinity so 

that '""(element)" allows any number, including zero; "l*element" 

requires at least one; and "l*2element" allows one or two. 

[rule] 

Square brackets enclose optional elements; "[foo barj" is 
equivalent to "*l(foo bar)". 

N rule 

Specific repetition: "<n>(element) " is equivalent to 
"<n>*<n> (element)"; that is, exactly <n> occurrences of (element). 
Thus 2DIGIT is a 2-digit number, and 3ALPHA is a string of three 
alphabetic characters. 

tfrule 

A construct "#" is defined, similar to "*",^for defining lists of 
elements. The full form is "<n>#<m>element " indicating at least 
<n> and at most <m> elements, each separated by one or more commas 
(",") and optional linear whitespace (LWS). This makes the usual 
form of lists very easy; a rule such as "( *LWS element *( *LWS 
*LWS element )) " can be shown as "l#element". Wherever this 
construct is used, null elements are allowed, but do not contribute 
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to the count of elements present. That is, "(element), , (element) 
" is permitted, but counts as only two elements. Therefore, where 
at least one element is required, at least one non-null element 
must be present. Default values are 0 and infinity so that 
"ttelement" allows any number, including zero; "l#element" requires 
at least one; and "l$2element" allows one or two. 

; comment 

A semi-colon, set off some distance to the right of rule text, 
starts a comment that continues to the end of line. This is a 
simple way of including useful notes in parallel with the 
specifications. 

implied *LWS 

The grammar described by this specification is word-based. Except 
where noted otherwise, linear whitespace (LWS) can be included 
between any two adjacent words (token or quoted-string) , and 
between adjacent tokens and delimiters (tspecials), without 
changing the interpretation of a field. At least one delimiter 
(tspecials) must exist between any two tokens, since they would 
otherwise be interpreted as a single token. 

2.2 Basic Rules 

The following rules are used throughout this specification to 
describe basic parsing constructs. The US- ASCI I coded character set 
is defined by ANSI X3. 4-1986 [21]. 



OCTET = <any 8-bit sequence of data> 

CHAR = <any US-ASCII character (octets 0 - 127) > 

UPALPHA = <any US-ASCII uppercase letter "A".."Z"> 

LOALPHA = <any US-ASCII lowercase letter "a". . "z"> 

ALPHA = UPALPHA I LOALPHA 

DIGIT = <any US-ASCII digit "0".."9"> 

CTL = <any US-ASCII control character 

(octets 0 - 31) and DEL (127) > 

CR =-<US-ASCII CR, carriage return (13)> 

LF = <US-ASCII LF, linefeed (10)> 

SP = <US-ASCII SP, space (32) > 

HT = <US-ASCII HT, horizontal -tab (9)> 

<"> = <US-ASCII double-quote mark (34) > 
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HTTP/1.1 defines the sequence CR LF as the end-of-line marker for all 
protocol elements except the entity-body (see appendix 19.3 for 
tolerant applications). The end-of-line marker within an entity-body 
is defined by its associated media type, as described in section 3.7. 

CRLF = CR LF 

HTTP/1.1 headers can be folded onto multiple lines if the 
continuation line begins with a space or horizontal tab. All linear 
white space, including folding, has the same semantics as SP. 

LWS = [CRLF] 1*( SP I HT ) 

The TEXT rule is only used for descriptive field contents and values 
that are not intended to be interpreted by the message parser. Words 
of TEXT may contain characters from character sets other than ISO 
8859-1 [22] only when encoded according to the rules of RFC 1522 
[14]. 

TEXT = <any OCTET except CTLs, 

but including LWS> 

Hexadecimal numeric characters are used in several protocol elements. 

HEX = "A" I "B" I "C" I "D" I "E" I "F" 

I "a" I "b" I "c" I "d" I "e" I "f" I DIGIT 

Many HTTP/1.1 header field values consist of words separated by LWS 
or special characters. These special characters MUST be in a quoted 
string to be used within a parameter value. 

token - l*<any CHAR except CTLs or tspecials> 

tspecials = "(". I ")" I I I 

| V | ";" | I "¥" I <"> 

I V" I "[" I "]" I "?" I 

I "I" I "I" I SP I HT 

Comments can be included in some HTTP header fields by surrounding 
the comment text with parentheses. Comments are only allowed in 
fields containing "comment" as part of their field value definition. 
In all other fields, parentheses are considered part of the field 
value. 

comment = "(■' *( ctext I comment ) ")" 

ctext = <any TEXT excluding "(" and ")"> 
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A string of text is parsed as a single word if it is quoted using 
double-quote marks. 

quoted-string = ( <"> *(qdtext) <"> ) 

qdtext = <any TEXT except <"» 

The backslash character ("¥") may be used as a single-character quoting 
mechanism only within quoted-string and comment constructs. 

quoted-pair = CHAR 

3 Protocol Parameters 

3. 1 HTTP Version 

HTTP uses a "<major>. <minor> ,t numbering scheme to indicate versions 
of the protocol. The protocol versioning policy is intended to allow 
the sender to indicate the format of a message and its capacity for 
understanding further HTTP communication, rather than the features 
obtained via that communication. No change is made to the version 
number for the addition of message components which do not affect 
communication behavior or which only add to extensible field values. 
The <minor> number is incremented when the changes made to the 
protocol add features which do not change the general message parsing 
algorithm, but which may add to the message semantics and imply 
additional capabilities of the sender. The <major> number is 
incremented when the format of a message within the protocol is 
changed. 

The version of an HTTP message is indicated by an HTTP-Version field 
in the first line of the message. 

HTTP-Version = "HTTP" 7" 1*DIGIT 'V 1*DIGIT 

Note that the major and minor numbers MUST be treated as separate 
integers and that each may be incremented higher than a single digit. 
Thus, HTTP/2.4 is a lower version than HTTP/2.13, which in turn is 
lower than HTTP/12.3. Leading zeros MUST be ignored by recipients and 
MUST NOT be sent. 

Applications sending Request or Response messages, as defined by this 
specification, MUST include an HTTP-Version of "HTTP/1. 1". Use of 
this version number indicates that the sending application is at 
least conditionally compliant with this specification. 

The HTTP version of an application is the highest HTTP version for 
which the application is at least conditionally compliant. 



Fielding, et. al. 



Standards Track 



[Page 17] 



RFC 2068 



HTTP/1 . 1 



January 1997 



Proxy and gateway applications must be careful when forwarding 
messages in protocol versions different from that of the application. 
Since the protocol version indicates the protocol capability of the 
sender, a proxy/gateway MUST never send a message with a version 
indicator which is greater than its actual version; if a higher 
version request is received, the proxy/gateway MUST either downgrade 
the request version, respond with an error, or switch to tunnel 
behavior. Requests with a version lower than that of the 
proxy/gateway* s version MAY be upgraded before being forwarded; the 
proxy/gateway* s response to that request MUST be in the same major 
version as the request. 

Note: Converting between versions of HTTP may involve modification 
of header fields required or forbidden by the versions involved. 

3.2 Uniform Resource Identifiers 

URIs have been known by many names: WWW addresses, Universal Document 
Identifiers, Universal Resource Identifiers , and finally the 
combination of Uniform Resource Locators (URL) and Names (URN). As 
far as HTTP is concerned, Uniform Resource Identifiers are simply 
formatted strings which identify — via name, location, or any other 
characteristic — a resource. 

3.2.1 General Syntax 

URIs in HTTP can be represented in absolute form or relative to some 
known base URI, depending upon the context of their use. The two 
forms are differentiated by the fact that absolute URIs always begin 
with a scheme name followed by a colon. 



URI 



( absoluteURI I relativeURI ) [ "#" fragment ] 



absoluteURI 



scheme ":" *( uchar I reserved ) 



net_path 
abs_path 
rel_path 



relativeURI 




net_path I abs_path I rel_path 

"//" net_loc [ abs_path ] 
"/" reljath 



[ "?" query ] 



path 

f segment 

segment 



f segment *( "/" segment ) 

l*pchar 

*pchar 



params 
param 



param *( param ) 
*( pchar I V" ) 
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I V 



scheme = 1*( ALPHA I DIGIT I V I I ) 

netjoc = *( pchar I ";" I "?" ) 

query = *( uchar I reserved ) 

fragment - *( uchar I reserved ) 

pchar = uchar I ":" I "8" I I I ."+" 

uchar = unreserved I escape 

unreserved = ALPHA I DIGIT I safe I extra I national 

escape = '"X" HEX HEX 

reserved = ";" I V" I "?" I ":" I I I V 

extra = I I I I"")" I V 

safe = "$" I I I " • " 

unsafe = CTL I SP I <"> I I "X" I I 

national = <any OCTET excluding ALPHA, DIGIT, 
reserved, extra, safe, and unsafe> 

For definitive information on URL syntax and semantics, see RFC 1738 
[4] and RFC 1808 [11]. The BNF above includes national characters not 
allowed in valid URLs as specified by RFC 1738, since HTTP servers 
are not restricted in the set of unreserved characters allowed to 
represent the rel_path part of addresses, and HTTP proxies may 
receive requests for URIs not defined by RFC 1738. 

The HTTP protocol does not place any a priori limit on the length of 
a URL Servers MUST be able to handle the URI of any resource they 
serve, and SHOULD be able to handle URIs of unbounded length if they 
provide GET-based forms that could generate such URIs. A server 
SHOULD return 414 (Request-URI Too Long) status if a URI is longer 
than the server can handle (see section 10.4.15). 

Note: Servers should be cautious about depending on URI lengths 
above 255 bytes, because some older client or proxy implementations 
may not properly support these lengths. 

3.2.2 http URL 

The "http" scheme is used to locate network resources via the HTTP 
protocol. This section. def ines the scheme- specific syntax and 
semantics for http URLs. 
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httpJJRL 



"http:" "//" host [ port ] [ absjath ] 



host 



<A legal Internet host domain name 
or IP address (in dotted-decimal form), 
as defined by Section 2.1 of RFC 1123> 



port 



♦DIGIT 



If the port is empty or not given, port 80 is assumed. The semantics 
are that the identified resource is located at the server listening 
for TCP connections on that port of that host, and the Request-URI 
for the resource is abs_path. The use of IP addresses in URL* s SHOULD 
be avoided whenever possible (see RFC 1900 [24]). If the abs_path is 
not present in the URL, it MUST be given as V" when used as a 
Request-URI for a resource (section 5.1.2). 

3.2.3 URI Comparison 

When comparing two URIs to decide if they match or not, a client 
SHOULD use a case-sensitive octet-by-octet comparison of the entire 
URIs, with these exceptions: 

o A port that is empty or not given is equivalent to the default 
port for that URI; 

o Comparisons of host names MUST be case-insensi tive; 

o Comparisons of scheme names MUST be case-insensitive; 

o An empty abs_path is equivalent to an abs_path of "/". 

Characters other than those in the "reserved" and "unsafe" sets (see 
section 3.2) are equivalent to their ""%" HEX HEX" encodings. 

For example, the following three URIs are equivalent: 



http://abc.com: 80/ smi th/home. html 
http: //ABC. com/%7Esmi th/home. html 
http: //ABC. com:/%7esmi th/home. html 
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3.3 Date/Time Formats 
3. 3.1 Full Date 

HTTP applications have historically allowed three different formats 
for the representation of date/time stamps: 

Sun 06 Nov 1994 08:49:37 GMT ; RFC 822, updated by RFC 1123 
Sunday, 06-Nov-94 08:49:37 GMT ; RFC 850, obsoleted by RFC 1036 
Sun Nov 6 08:49:37 1994 ; ANSI C's asctimeO format 

The first format is preferred as an Internet standard and represents 
a fixed-length subset of that defined by RFC 1123 (an update to RFC 
822). The second format is in common use, but is based on the 
obsolete RFC 850 [12] date format and lacks a four-digit year. 
HTTP/1.1 clients and servers that parse the date value MUST accept 
all three formats (for compatibility with HTTP/1.0), though they MUST 
only generate the RFC 1123 format for representing HTTP-date values 
in header fields. 

Note: Recipients of date values are encouraged to be robust in 
accepting date values that may have been sent by non-HTTP 
applications, as is sometimes the case when retrieving or posting 
messages via proxies/gateways to SMTP or NNTP. 

All HTTP date/time stamps MUST be represented in Greenwich Mean Time 
(GMT), without exception. This is indicated in the first two formats 
by the inclusion of "GMT" as the three-letter abbreviation for time 
zone, and MUST be assumed when reading the asctime format. 

HTTP-date = rfcll23-date I rfc850-date I asctime-date 

rfcll23-date = wkday V SP datel SP time SP "GMT" 
rfc850-date = weekday V SP date2 SP time SP "GMT" 
asctime-date = wkday SP date3 SP time SP 4DIGIT 



datel 



= 2DIGIT SP month SP 4DIGIT 



; day month year (e.g., 02 Jun 1982) 



date2 



= 2DIGIT month "-" 2DIGIT 



date3 




; month day (e.g., Jun 2) 



t ime 



= 2DIGIT ":" 2DIGIT ":" 2DIGIT 
; 00:00:00 - 23:59:59 



wkday 



= "Mon" I "Tue" I "Wed" 

I "Thu" I "Fri" I "Sat" I "Sun' 
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weekday = "Monday" I "Tuesday" I "Wednesday" 

I "Thursday" I "Friday" I "Saturday" I "Sunday" 

month = "Jan" I "Feb" I "Mar" I "Apr" 

I "May" I "Jun" I "Jul" I "Aug" 
I "Sep" I "Oct" I "Nov" I "Dec" 

Note: HTTP requirements for the date/time stamp format apply only 
to their usage within the protocol stream. Clients and servers are 
not required to use these formats for user presentation, request 
logging, etc. 

3.3.2 Delta Seconds 

Some HTTP header fields allow a time value to be specified as an 
integer number of seconds, represented in decimal, after the time 
that the message was received. 

delta-seconds = 1*DIGIT 

3.4 Character Sets 

HTTP uses the same definition of the term "character set" as that 
described for MIME: 

The term "character set" is used in this document to refer to a 
method used with one or more tables to convert a sequence of octets 
into a sequence of characters. Note that unconditional conversion 
in the other direction is not required, in that not all characters 
may be available in a given character set and a character set may 
provide more than one sequence of octets to represent a particular 
character. This definition is intended to allow various kinds of 
character encodings, from simple single-table mappings such as US- 
ASCII to complex table switching methods such as those that use ISO 
2022* s techniques. However, the definition associated with a MIME 
character set name MUST fully specify the mapping to be performed 
from octets to characters. In particular, use of external profiling 
information to determine the exact mapping is not permitted. 

Note: This use of the term "character set" is more commonly 

referred to as a "character encoding." However, since HTTP and MIME 

share the same registry, it is important that the terminology also 
be shared. 



Fielding, et. al. 



Standards Track 



[Page 22] 



1 



Hit 2068 HTTP/1.1 January 1997 



HTTP character sets are identified by case-insensi t ive tokens. The 
complete set of tokens is defined by the I ANA Character Set registry 



charset = token 

Although HTTP allows an arbitrary token to be used as a charset 
value, any token that has a predefined value within the IANA 
Character Set registry MUST represent the character set defined by 
that registry. Applications SHOULD limit their use of character sets 
to those defined by the IANA registry. 

3.5 Content Codings 

Content coding values indicate an encoding transformation that has 
been or can be applied to an entity. Content codings are primarily 
used to allow a document to be compressed or otherwise usefully 
transformed without losing the identity of its underlying media type 
and without loss of information. Frequently, the entity is stored in 
coded form, transmitted directly, and only decoded by the recipient. 

content-coding = token 

All content-coding values are case-insensitive. HTTP/1.1 uses 
content-coding values in the Accept -Encoding (section 14.3) and 
Content-Encoding (section 14.12) header fields. Although the value 
describes the content-coding, what is more important is that it 
indicates what decoding mechanism will be required to remove the 
encoding. 

The Internet Assigned Numbers Authority (IANA) acts as a registry for 
con tent -coding value tokens. Initially, the registry contains the 
following tokens: 

gzip An encoding format produced by the file compression program "gzip" 
(GNU zip) as described in RFC 1952 [25]. This format is a Lempel- 
Ziv coding (LZ77) with a 32 bit CRC. 

compress 

The encoding format produced by the common UNIX file compression 
program "compress". This format is an adaptive Lempel-Ziv-Welch 
coding (LZW). 
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Note: Use of program names for the identification of encoding 
formats is not desirable and should be discouraged for future 
encodings. Their use here is representative of historical practice, 
not good design. For compatibility with previous implementations of 
HTTP, applications should consider "x-gzip" and "x-compress" to be 
equivalent to "gzip" and "compress" respectively. 

deflate The "zlib" format defined in RFC 1950[31] in combination with 
the "deflate" compression mechanism described in RFC 1951 [29]. 

New con tent -coding value tokens should be registered; to allow 
interoperability between clients and servers, specifications of the 
content coding algorithms needed to implement a new value should be 
publicly available and adequate for independent implementation, and 
conform to the purpose of content coding defined in this section. 

3.6 Transfer Codings 

Transfer coding values are used to indicate an encoding 
transformation that has been, can be, or may need to be applied to an 
entity-body in order to ensure "safe transport" through the network. 
This differs from a content coding in that the transfer coding is a 
property of the message, not of the original entity. 

transfer -coding = "chunked" [ transfer-extension 

transfer-extension = token 

All transfer-coding values are case-insensi t ive. HTTP/1.1 uses 
transfer coding values in the Transfer-Encoding header field (section 
14.40). 

Transfer codings are analogous to the Content -Transfer-Encoding 
values of MIME , which were designed to enable safe transport of 
binary data over a 7-bit transport service. However, safe transport 
has a different focus for an 8bit-clean transfer protocol. In HTTP, 
the only unsafe characteristic of message-bodies is the difficulty in 
determining the exact body length (section 7.2.2), or the desire to 
encrypt data over a shared transport. 

The chunked encoding modifies the body of a message in order to 
transfer it as a series of chunks, each with its own size indicator, 
followed by an optional footer containing entity-header fields. This 
allows dynamically-produced content to be transferred along with the 
information necessary for the recipient to verify that it has 
received the full message. 
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Chunked-Body 



* chunk 
"0" CRLF 
footer 
CRLF 



chunk 



chunk-size [ chunk-ext ] CRLF 
chunk-data CRLF 



hex-no-zero 



<HEX excluding "0"> 



chunk-ex t -name 
chunk-ext -val 
chunk-data 



chunk-s i ze 
chunk-ext 



hex-no-zero *HEX 

*( ";" chunk-ext -name [ "=" chunk-ext-value ] ) 
token 

token I quoted-string 
chunk-size (OCTET) 



footer 



*ent ity-header 



The chunked encoding is ended by a zero-sized chunk followed by the 
footer, which is terminated by an empty line. The purpose of the 
footer is to provide an efficient way to supply information about an 
entity that is generated dynamically; applications MUST NOT send 
header fields in the footer which are not explicitly defined as being 
appropriate for the footer, such as Content-MD5 or future extensions 
to HTTP for digital signatures or other facilities. 

An example process for decoding a Chunked-Body is presented in 
appendix 19.4.6. 

All HTTP/1.1 applications MUST be able to receive and decode the 
"chunked" transfer coding, and MUST ignore transfer coding extensions 
they do not understand. A server which receives an entity-body with a 
transfer-coding it does not understand SHOULD return 501 
(UnimplementedJ, and close the connection. A server MUST NOT send 
transfer-codings to an HTTP/1.0 client. 

3.7 Media Types 

HTTP uses Internet Media Types in the Content-Type (section 14.18) 
and Accept (section 14.1) header fields in order to provide open and 
extensible data typing and type negotiation. 



Parameters may follow the type/subtype in the form of attribute/value 
pairs. 



media- type 

type 

subtype 



= type "/" subtype *( ";" parameter ) 
= token 
= token 
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parameter 
attribute 
value 



attribute "=" value 
token 

token I quoted-string 



The type, subtype, and parameter attribute names are case- 
insensitive. Parameter values may or may not be case-sensitive, 
depending on the semantics of the parameter name. Linear white space 
(LWS) MUST NOT be used between the type and subtype, nor between an 
attribute and its value. User agents that recognize the media-type 
MUST process (or arrange to be processed by any external applications 
used to process that type/subtype by the user agent) the parameters 
for that MIME type as described by that type/subtype definition to 
the and inform the user of any problems discovered. 

Note: some older HTTP applications do not recognize media type 
parameters. When sending data to older HTTP applications, 
implementations should only use media type parameters when they are 
required by that type/subtype definition. 

Media-type values are registered with the Internet Assigned Number 
Authority (IANA). The media type registration process is outlined in 
RFC 2048 [17]. Use of non-registered media types is discouraged. 

3.7.1 Canonical izat ion and Text Defaults 

Internet media types are registered with a canonical form. In 
general, an entity-body transferred via HTTP messages MUST be 
represented in the appropriate canonical form prior to its 
transmission; the exception is "text" types, as defined in the next 
paragraph. 

When in canonical form, media subtypes of the "text" type use CRLF as 
the text line break. HTTP relaxes this requirement and allows the 
transport of text media with plain CR or LF alone representing a line 
break when it is done consistently for an entire entity-body. HTTP 
applications MUST accept CRLF, bare CR, and bare LF as being 
representative of a line break in text media received via HTTP. In 
addition, if the text is represented in a character set that does not 
use octets 13 and 10 for CR and LF respectively, as is the case for 
some multi-byte character sets, HTTP allows the use of whatever octet 
sequences are defined by that character set to represent the 
equivalent of CR and LF for line breaks. This flexibility regarding 
line breaks applies only to text media in the entity-body; a bare CR 
or LF MUST NOT be substituted for CRLF within any of the HTTP control 
structures (such as header fields and multipart boundaries). 

If an entity-body is encoded with a Content-Encoding, the underlying 
data MUST be in a form defined above prior to being encoded. 
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The "charset" parameter is used with some media types to define the 
character set (section 3.4) of the data. When no explicit charset 
parameter is provided by the sender, media subtypes of the "text" 
type are defined to have a default charset value of "ISO-8859-1" when 
received via HTTP. Data in character sets other than "ISO-8859-1" or 
its subsets MUST be labeled with an appropriate charset value. 

Some HTTP/1 . 0 software has interpreted^ Content-Type header without 
charset parameter incorrectly to mean "recipient should guess." 
Senders wishing to defeat this behavior MAY include a charset 
parameter even when the charset is ISO-8859-1 and SHOULD do so when 
it is known that it will not confuse the recipient. 

Unfortunately, some older HTTP/1.0 clients did not deal properly with 
an explicit charset parameter. tfTTP/1. 1 recipients MUST respect the 
charset label provided by the sender; and those user agents that have 
a provision to "guess" a charset MUST use the charset from the 
content-type field if they support that charset, rather than the 
recipient's preference, when initially displaying a document. 

3.7.2 Multipart Types 

MIME provides for a number of "multipart" types — encapsulations of 
one or more entities within a single message-body. All multipart 
types share a common syntax, as defined in MIME [7], and MUST 
include a boundary parameter as part of the media type value. The 
message body is itself a protocol element and MUST therefore use only 
CRLF to represent line breaks between body-parts. Unlike in MIME, the 
epilogue of any multipart message MUST be empty; HTTP applications 
MUST NOT transmit the epilogue (even if the original multipart 
contains an epilogue). 

In HTTP, multipart body-parts MAY contain header fields which are 
significant to the meaning of that part. A Con tent -Location header 
field (section 14.15) SHOULD be included in the body-part of each 
enclosed entity that can be identified by a URL. 

In general, an HTTP user agent SHOULD follow the same or similar 
behavior as a MIME user agent would upon receipt of a multipart type. 
If an application receives an unrecognized multipart subtype, the 
application MUST treat it as being equivalent to "multipart/mixed". 

Note: The "mul ti part /form-data" type has been specifically defined 
for carrying form data suitable for processing via the POST request 
method, as described in RFC 1867 [15] . 
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3.8 Product Tokens 

Product tokens are used to allow communicating applications to 
identify themselves by software name and version. Most fields using 
product tokens also allow sub-products which form a significant part 
of the application to be listed, separated by whitespace. By 
convention, the products are listed in order of their significance 
for identifying the application. 

product = token ["/" product-version] 

product-version = token 

Examples: 

User-Agent: CERN-LineMode/2. 15 1 ibwww/2. 17b3 
Server : Apache/0. 8. 4 

Product tokens should be short and to the point — use of them for 
advertising or other non-essential information is explicitly 
forbidden. Although any token character may appear in a product- 
version, this token SHOULD only be used for a version identifier 
(i.e., successive versions of the same product SHOULD only differ in 
the product -vers ion portion of the product value). 

3.9 Quality Values 

HTTP content negotiation (section 12) uses short "floating point" 
numbers to indicate the relative importance ("weight") of various 
negotiable parameters. A weight is normalized to a real number in the 
range 0 through 1, where 0 is the minimum and 1 the maximum value. 
HTTP/1.1 applications MUST NOT generate more than three digits after 
the decimal point. User configuration of these values SHOULD also be 
limited in this fashion. 



qvalue = ( "0" 

I ( "1" 



0*3DIGIT 
0*3 ("0") 



"Quality values" is a misnomer, since these values merely represent 
relative degradation in desired quality. 

3. 10 Language Tags 

A language tag identifies a natural language spoken, written, or 
otherwise conveyed by human beings for communication of information 
to other human beings. Computer languages are explicitly excluded. 
HTTP uses language tags within the Accept -Language and Content- 
Language fields. 
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The syntax and registry of HTTP language tags is the same as that 
defined by RFC 1766 [1J. In summary, a language tag is composed of 1 
or more parts: A primary language tag and a possibly empty series of 
sub tags: 

language-tag = primary-tag *( "-" subtag ) 

primary- tag = 1*8ALPHA 
subtag = 1*8ALPHA 

Whitespace is not allowed within the tag and all tags are case- 
insensitive. The name space of language tags is administered by the 
I ANA. Example tags include: 

en, en-US, en-cockney, i-cherokee, x-pig-latin 

where any two-letter primary-tag is an ISO 639 language abbreviation 
and any two-letter initial subtag is an ISO 3166 country code. (The 
last three tags above are not registered tags; all but the last are 
examples of tags which could be registered in future.) 

3. 11 Entity Tags 

Entity tags are used for comparing two or more entities from the same 
requested resource. HTTP/1.1 uses entity tags in the ETag (section 
14.20), If-Match (section 14.25), If-None-Match (section 14.26), and 
If-Range (section 14.27) header fields. The definition of how they 
are used and compared as cache validators is in section 13.3.3. An 
entity tag consists of an opaque quoted string, possibly prefixed by 
a weakness indicator. 

entity-tag = [ weak ] opaque-tag 

weak = "W/" 

opaque-tag = quoted-string 

A "strong entity tag" may be shared by two entities of a resource 
only if they are equivalent by octet equality. 

A "weak entity tag," indicated by the "W/" prefix, may be shared by 
two entities of a resource only if the entities are equivalent and 
could be substituted for each other with no significant change in 
semantics. A weak entity tag can only be used for weak comparison. 

An entity tag MUST be unique across all versions of all entities 
associated with a particular resource. A given entity tag value may 
be used for entities obtained by requests on different URIs without 
implying anything about the equivalence of those entities. 
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3. 12 Range Units 

HTTP/1.1 allows a client to request that only part (a range of) the 
response entity be included within the response. HTTP/1.1 uses range 
units in the Range (section 14.36) and Content-Range (section 14.17) 
header fields. An entity may be broken down into subranges according 
to various structural units. 

range-unit = bytes-unit I other-range-unit 

bytes-unit = "bytes" 

other-range-unit = token 

The only range unit defined by HTTP/1. 1 is "bytes". HTTP/1.1 

implementations may ignore ranges specified using other units. 
HTTP/1.1 has been designed to allow implementations of applications 
that do not depend on knowledge of ranges. 

4 HTTP Message 

4. 1 Message Types 

HTTP messages consist of requests from client to server and responses 
from server to client. 

HTTP-message = Request I Response ; HTTP/1.1 messages 

Request (section 5) and Response (section 6) messages use the generic 
message format of RFC 822 [9] for transferring entities (the payload 
of the message). Both types of message consist of a start-line, one 
or more header fields (also known as "headers"), an empty line (i.e., 
a line with nothing preceding the CRLF) indicating the end of the 
header fields, and an optional message-body. 

generic-message = start-line 

*message-header 
CRLF 

[ message-body ] 

start-line = Request-Line I Status-Line 

In the interest of robustness, servers SHOULD ignore any empty 
line(s) received where a Request-Line is expected. In other words, if 
the server is reading the protocol stream at the beginning of a 
message and receives a CRLF first, it should ignore the CRLF. 
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Note: certain buggy HTTP/1.0 client implementations generate an 
extra CRLF's after a POST request. To restate what is explicitly 
forbidden by the BNF, an HTTP/1. 1 client must not preface or follow 
a request with an extra CRLF. 

4.2 Message Headers 

HTTP header fields, which include general -header (section 4.5), 
request -header (section 5.3), response-header (section 6.2), and 
entity-header (section 7.1) fields, follow the same generic format as 
that given in Section 3.1 of RFC 822 [9]. Each header field consists 
of a name followed by a colon (":") and the field value. Field names 
are case-insensi tive. The field value may be preceded by any amount 
of LWS, though a single SP is preferred. Header fields can be 
extended over multiple lines by preceding each extra line with at 
least one SP or HT. Applications SHOULD follow "common form" when 
generating HTTP constructs, since there might exist some 
implementations that fail to accept anything beyond the common forms. 

message-header = field-name ":" [ field-value ] CRLF 

field-name = token 

field-value = *( field-content I LWS ) 

field-content = <the OCTETs making up the field-value 

and consisting of either *TEXT or combinations 
of token, tspecials, and quoted-string> 

The order in which header fields with differing field names are 
received is not significant. However, it is "good practice" to send 
general -header fields first, followed by request -header or response- 
header fields, and ending with the entity-header fields. 

Multiple message-header fields with the same field-name may be 
present in a message if and only if the entire field-value for that 
header field is defined as a comma-separated list [i.e., # (values)]. 
It MUST be possible to combine the multiple header fields into one 
"field-name: field-value" pair, without changing the semantics of the 
message, by appending each subsequent field-value to the first, each 
separated by a comma. The order in which header fields with the same 
field-name are received is therefore significant to the 
interpretation of the combined field value, and thus a proxy MUST NOT 
change the order of these field values when a message is forwarded. 
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4. 3 Message Body 

The message-body (if any) of an HTTP message is used to carry the 
entity-body associated with the request or response. The message-body 
differs from the entity-body only when a transfer coding has been 
applied, as indicated by the Transfer-Encoding header field (section 
14.40). 

message-body = entity-body 

I <entity-body encoded as per Trans fer-Encoding> 

Transfer-Encoding MUST be used to indicate any transfer codings 
applied by an application to ensure safe and proper transfer of the 
message. Transfer-Encoding is a property of the message, not of the 
entity, and thus can be added or removed by any application along the 
request/response chain. 

The rules for when a message-body is allowed in a message differ for 
requests and responses. 

The presence of a message-body in a request is signaled by the 
inclusion of a Content-Length or Transfer-Encoding header field in 
the request's message-headers. A message-body MAY be included in a 
request only when the request method (section 5.1.1) allows an 
entity-body. 

For response messages, whether or not a message-body is included with 
a message is dependent on both the request method and the response 
status code (section 6.1.1). All responses to the HEAD request method 
MUST NOT include a message-body, even though the presence of entity- 
header fields might lead one to believe they do. All lxx 
(informational), 204 (no content), and 304 (not modified) responses 
MUST NOT include a message-body. All other responses do include a 
message-body, although it may be of zero length. 

4.4 Message Length 

When a message-body is included with a message, the length of that 
body is determined by one of the following (in order of precedence): 

1. Any response message which MUST NOT include a message-body 

(such as the lxx, 204, and 304 responses and any response to a HEAD 
request) is always terminated by the first empty line after the 
header fields, regardless of the entity-header fields present in the 
message. 

2. If a Transfer-Encoding header field (section 14.40) is present and 
indicates that the "chunked" transfer coding has been applied, then 
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the length is defined by the chunked encoding (section 3.6). 

3. If a Con tent -Length header field (section 14.14) is present, its 
value in bytes represents the length of the message-body. 

4. If the message uses the media type "mul tipart/byteranges", which is 
self-delimiting, then that defines the length. This media type MUST 
NOT be used unless the sender knows that the recipient can parse it; 
the presence in a request of a Range header with multiple byte-range 
specifiers implies that the client can parse mul t i part/by teranges 
responses. 

5. By the server closing the connection. (Closing the connection 
cannot be used to indicate the end of a request body, since that 
would leave no possibility for the server to send back a response.) 

For compatibility with HTTP/1.0 applications, HTTP/1.1 requests 
containing a message-body MUST include a valid Content-Length header 
field unless the server is known to be HTTP/1.1 compliant. If a 
request contains a message-body and a Content-Length is not given, 
the server SHOULD respond with 400 (bad request) if it cannot 
determine the length of the message, or with 411 (length required) if 
it wishes to insist on receiving a valid Content-Length. 

All HTTP/1.1 applications that receive entities MUST accept the 
"chunked" transfer coding (section 3.6), thus allowing this mechanism 
to be used for messages when the message length cannot be determined 
in advance. 

Messages MUST NOT include both a Con tent -Length header field and the 
"chunked" transfer coding. If both are received, the Con tent -Length 
MUST be ignored. 

When a Con tent -Length is given in a message where a message-body is 
allowed, its field value MUST exactly match the number of OCTETs in 
the message-body. HTTP/L 1 user agents MUST notify the user when an 
invalid length is received and detected. 
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4.5 General Header Fields 

There are a few header fields which have general applicability for 
both request and response messages, but which do not apply to the 
entity being transferred. These header fields apply only to the 
message being transmitted. 

general -header = Cache-Control ,* Section 14.9 

I Connection ; Section 14.10 

I Date ; Section 14.19 

I Pragma ; Section 14.32 

I Transfer-Encoding ; Section 14.40 

I Upgrade ; Section 14.41 

I Via ; Section 14.44 

General -header field names can be extended reliably only in 
combination with a change in the protocol version. However, new or 
experimental header fields may be given the semantics of general 
header fields if all parties in the communication recognize them to 
be general -header fields. Unrecognized header fields are treated as 
entity-header fields. 

5 Request 

A request message from a client to a server includes, within the 
first line of that message, the method to be applied to the resource, 
the identifier of the resource, and the protocol version in use. 

Request = Request-Line ; Section 5.1 

*( general -header ; Section 4.5 

I request -header ; Section 5.3 

I entity-header ) ; Section 7.1 
CRLF 

[message-body ] ; Section 7.2 

5. 1 Request -Line 

The Request -Line begins with a method token, followed by the 
Request-URI and the protocol version, and ending with CRLF. The 
elements are separated by SP characters. No CR or LF are allowed 
except in the final CRLF sequence. 

Request-Line = Method SP Request-URI SP HTTP-Version CRLF 
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5.1.1 Method 

The Method token indicates the method to be performed on the resource 
identified by the Request-URI. The method is case-sensitive. 



extension-method = token 

The list of methods allowed by a resource can be specified in an 
Allow header field (section 14.7). The return code of the response 
always notifies the client whether a method is currently allowed on a 
resource, since the set of allowed methods can change dynamically. 
Servers SHOULD return the status code 405 (Method Not Allowed) if the 
method is known by the server but not allowed for the requested 
resource, and 501 (Not Implemented) if the method is unrecognized or 
not implemented by the server. The list of methods known by a server 
can be listed in a Public response-header field (section 14.35). 

The methods GET and HEAD MUST be supported by all general -purpose 
servers. All other methods are optional; however, if the above 
methods are implemented, they MUST be implemented with the same 
semantics as those specified in section 9. 

5.1.2 Request-URI 

The Request-URI is a Uniform Resource Identifier (section 3.2) and 
identifies the resource upon which to apply the request. 

Request-URI = "*" I absoluteURI I abs_path 

The three options for Request-URI are dependent on the nature of the 
request. The asterisk "*" means that the request does not apply to a 
particular resource, but to the server itself, and is only allowed 
when the method used does not necessarily apply to a resource. One 
example would be 

OPTIONS * HTTP/1. 1 

The absoluteURI form is required when the request is being made to a 
proxy. The proxy is requested to forward the request or service it 



Method 



= "OPTIONS" 



"GET" 

"HEAD" 

"POST" 

"PUT" 

"DELETE" 

"TRACE" 



; Section 9. 2 
; Section 9. 3 
; Section 9.4 
; Section 9. 5 
; Section 9. 6 
; Section 9. 7 
; Section 9.8 



ext ens i on-method 
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from a valid cache, and return the response. Note that the proxy MAY 
forward the request on to another proxy or directly to the server 
specified by the absoluteURI. In order to avoid request loops, a 
proxy MUST be able to recognize all of its server names, including 
any aliases, local variations, and the numeric IP address. An example 
Request-Line would be: 

GET http://ww.w3.org/pub/WWW/TheProject.html HTTP/1. 1 

To allow for transition to absoluteURIs in all requests in future 
versions of HTTP, all HTTP/1. 1 servers MUST accept the absoluteURI 
form in requests, even though HTTP/1. 1 clients will only generate 
them in requests to proxies. 

The most common form of Request-URI is that used to identify a 
resource on an origin server or gateway. In this case the absolute 
path of the URI MUST be transmitted (see section 3.2.1, abs_path) as 
the Request-URI, and the network location of the URI (net_loc) MUST 
be transmitted in a Host header field. For example, a client wishing 
to retrieve the resource above directly from the origin server would 
create a TCP connection to port 80 of the host "www.w3.org" and send 
the lines: 

GET /pub/WWW/ThePro ject.html HTTP/1.1 
Host: www.w3.org 

followed by the remainder of the Request. Note that the absolute path 
cannot be empty; if none is present in the original URI, it MUST be 
given as "/" (the server root). 

If a proxy receives a request without any path in the Request-URI and 
the method specified is capable of supporting the asterisk form of 
request, then the last proxy on the request chain MUST forward the 
request with "*" as the final Request-URI. For example, the request 

OPTIONS http: //www. ics. uci.edu: 8001 HTTP/1.1 

would be forwarded by the proxy as 

OPTIONS * HTTP/1. 1 

Host: www. ics. uci.edu: 8001 

after connecting to port 8001 of host "www.ics.uci.edu". 

The Request-URI is transmitted in the format specified in section 
3.2.1. The origin server MUST decode the Request-URI in order to 
properly interpret the request. Servers SHOULD respond to invalid 
Request-URIs with an appropriate status code. 
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In requests that they forward, proxies MUST NOT rewrite the 
"abs_path" part of a Request-URI in any way except as noted above to 
replace a null abs_path with "*", no matter what the proxy does in 
its internal implementation. 

Note: The "no rewrite" rule prevents the proxy from changing the 
meaning of the request when the origin server is improperly using a 
non-reserved URL character for a reserved purpose. Implementers 
should be aware that some pre-HTTP/1. 1 proxies have been known to 
rewrite the Request-URI. 

5.2 The Resource Identified by a Request 

HTTP/1.1 origin servers SHOULD be aware that the exact resource 
identified by an Internet request is determined by examining both the 
Request-URI and the Host header field. 

An origin server that does not allow resources to differ by the 
requested host MAY ignore the Host header field value. (But see 
section 19.5.1 for other requirements on Host support in HTTP/1. L) 

An origin server that does differentiate resources based on the host 
requested (sometimes referred to as virtual hosts or vanity 
hostnames) MUST use the following rules for determining the requested 
resource on an HTTP/1.1 request: 

1. If Request-URI is an absoluteURI, the host is part of the 
Request-URI. Any Host header field value in the request MUST be 
ignored. 

2. If the Request-URI is not an absoluteURI, and the request 
includes a Host header field, the host is determined by the Host 
header field value. 

3. If the host as determined by rule 1 or 2 is not a valid host on 
the server, the response MUST be a 400 (Bad Request) error 
message. 

Recipients of an HTTP/1.0 request that lacks a Host header field MAY 
attempt to use heuristics (e.g., examination of the URI path for 
something unique to a particular host) in order to determine what 
exact resource is being requested. 

5.3 Request Header Fields 

The request -header fields allow the client to pass additional 
information about the request, and about the client itself, to the 
server. These fields act as request modifiers, with semantics 
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equivalent to the parameters on a programming language method 
invocation. 

request-header = 



Accept 


\ Section 


14. 


1 


Ar c pn t — Cha r ^ p t 


* Spr t i on 


14. 


2 


Arrpnt— Enrocl i nc 


* Spc t i on 


14. 


3 


Arrant— T ancniacp 


* Sprt i on 


14. 


4 


Author i t i on 


' Section 


14. 


8 


Prom 


* Spp t i on 


14. 


22 


Host 


Section 


14. 


23 


I f -Mod i f i ed-S i nee 


Section 


14. 


24 


If-Match 


Section 


14.25 


If-None-Match 


Section 


14. 


26 


If -Range 


Section 


14. 


27 


I f -Unmod i f i ed-S i nee ; 


Section 


14. 


28 


Max-Forwards ; 


Section 


14. 


31 


Proxy-Author i zat i on ; 


Section 


14.34 


Range ; 


Section 


14. 


36 


Referer ; 


Section 


14. 


37 


User-Agent ; 


Section 


14. 


42 



Request -header field names can be extended reliably only in 
combination with a change in the protocol version. However, new or 
experimental header fields MAY be given the semantics of request- 
header fields if all parties in the communication recognize them to 
be request -header fields. Unrecognized header fields are treated as 
entity-header fields. 

6 Response 

After receiving and interpreting a request message, a server responds 
with an HTTP response message. 



Response 



- Status-Line 

*( general -header 
I response-header 
I entity-header ) 

CRLF 

[ message-body ] 



Section 6. 1 
Section 4.5 
Section 6.2 
Section 7. 1 

Section 7.2 



6. 1 Status-Line 



The first line of a Response message is the Status-Line, consisting 
of the protocol version followed by a numeric status code and its 
associated textual phrase, with each element separated by SP 
characters. No CR or LF is allowed except in the final CRLF 
sequence. 
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Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF 
1.1 Status Code and Reason Phrase 

The Status-Code element is a 3-digit integer result code of the 
attempt to understand and satisfy the request. These codes are fully 
defined in section 10. The Reason-Phrase is intended to give a short 
textual description of the Status-Code. The Status-Code is intended 
for use by automata and the Reason-Phrase is intended for the human 
user. The client is not required to examine or display the Reason- 
Phrase. 

The first digit of the Status-Code defines the class of response. The 
last two digits do not have any categorization role. There are 5 
values for the first digit: 

o lxx: Informational - Request received, continuing process 

o 2xx: Success - The action was successfully received, understood, 
and accepted 

o 3xx: Redirection - Further action must be taken in order to 
complete the request 

o 4xx: Client Error - The request contains bad syntax or cannot be 
fulfilled 



5xx: Server Error 
valid request 



The server failed to fulfill an apparently 



The individual values of the numeric status codes defined for 
HTTP/1.1, and an example set of corresponding Reason-Phrase's, are 
presented below. The reason phrases listed here are only recommended 
— they may be replaced by local equivalents without affecting the 
protocol. 

Status-Code 



= "100" 


; Continue 


1 "101" 


; Switching Protocols 


1 "200" 


; OK 


1 "201" 


; Created 


I "202" 


; Accepted 


1 "203" 


' Non-Authoritative Information 


1 "204" 


No Content 


1 "205" 


Reset Content 


1 "206" 


Partial Content 


1 "300" 


Multiple Choices 


1 "301" ; 


Moved Permanently 


1 "302" ; 


Moved Temporarily 
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"303" 


• See Other 


"304" 


* Not Modified 


"305" 


J Use Proxy 


"400" 


I Bad Request 


"401" 


* Unau t hor i zed 


"402" 


* Pavmpnt Rpnui red 


"403" 


* ForhiHdpn 


"404" 


' No t Found 


"40^" 


' MpfhnH Not Al 1 owpH 


"406" 


Not Arrpn table 


"407" 


Proxv Authentication Reouired 


"408" 


Rpnnp^t Timp— out 




i nn t 1 i rf - 




ffOTlP 
VJU1 


"411" 

4-1.-1 


T pnorth Rpnnifpd 


"41 ?" 


PrornnH i 1" i nn F21 1 1 pH 


"41 9" 


lpc t" T?n 1" 1 f" \7 TnA T QfCTf^ 


"414" 


Ppnnpc t-IFRT Too I artrp 


"415" ; 


Unsupported Media Type 


"500" ; 


Internal Server Error 


"501" ; 


Not Implemented 


"502" ; 


Bad Gateway 


"503" ; 


Serv i ce Unava i 1 ab 1 e 


"504" ; 


Gateway Time-out 


"505" ; 


HTTP Version not supported 


ex t ens i on -code 



extension-code = 3DIGIT 

Reason-Phrase = *<TEXT, excluding CR, LF> 

HTTP status codes are extensible. HTTP applications are not required 
to understand the meaning of all registered status codes, though such 
understanding is obviously desirable. However, applications MUST 
understand the class of any status code, as indicated by the first 
digit, and treat any unrecognized response as being equivalent to the 
xOO status code of that class, with the exception that an 
unrecognized response MUST NOT be cached. For example, if an 
unrecognized status code of 431 is received by the client, it can 
safely assume that there was something wrong with its request and 
treat the response as if it had received a 400 status code. In such 
cases, user agents SHOULD present to the user the entity returned 
with the response, since that entity is likely to include human- 
readable information which will explain the unusual status. 
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6.2 Response Header Fields 

The response-header fields allow the server to pass additional 
information about the response which cannot be placed in the Status- 
Line. These header fields give information about the server and about 
further access to the resource identified by the Request-URI. 



Response-header field names can be extended reliably only in 
combination with a change in the protocol version. However, new or 
experimental header fields MAY be given the semantics of response- 
header fields if all parties in the communication recognize them to 
be response-header fields. Unrecognized header fields are treated as 
entity-header fields. 



Request and Response messages MAY transfer an entity if not otherwise 
restricted by the request method or response status code. An entity 
consists of entity-header fields and an entity-body, although some 
responses will only include the entity-headers. 

In this section, both sender and recipient refer to either the client 
or the server, depending on who sends and who receives the entity. 

7.1 Entity Header Fields 

Entity-header fields define optional metainformation about the 
entity-body or, if no body is present, about the resource identified 
by the request. 



Fielding, et. al. Standards Track [Page 41] 



response-header = Age 



Location 
Proxy-Authenticate 
Public 
Retry-After 
Server 
Vary 
Warning 

WWW-Authenticate 



Section 14.6 
Section 14.30 
Section 14. 33 
Section 14. 35 
Section 14.38 
Section 14. 39 
Section 14.43 
Section 14.45 
Section 14.46 



7 Entity 
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entity-header = 



Al low 

Content-Base 
Con t en t -Encod i ng 
Con t en t -Language 
Content-Length 
Con t en t -Loca t i on 
Content-MD5 
Con tent -Range 
Con tent -Type 
ETag 
Expires 
Last-Modified 
extens i on-header 



; Secti 

; Secti 

; Secti 

; Secti 

; Secti 

; Secti 

; Secti 

; Secti 

; Secti 

; Secti 

; Secti 

; Secti 



on 
on 
on 
on 
on 
on 
on 
on 
on 
on 
on 
on 



14.7 

14.11 

14.12 

14.13 

14.14 

14.15 

14.16 

14.17 

14.18 

14.20 

14.21 

14.29 



extension-header = message-header 

The extension-header mechanism allows additional entity-header fields 
to be defined without changing the protocol, but these fields cannot 
be assumed to be recognizable by the recipient. Unrecognized header 
fields SHOULD be ignored by the recipient and forwarded by proxies. 

7.2 Entity Body 

The entity-body (if any) sent with an HTTP request or response is in 
a format and encoding defined by the entity-header fields. 

entity-body = *0CTET 

An entity-body is only present in a message when a message-body is 
present, as described in section 4.3. The entity-body is obtained 
from the message-body by decoding any Transfer-Encoding that may have 
been applied to ensure safe and proper transfer of the message. 



When an entity-body is included with a message, the data type of that 
body is determined via the header fields Content-Type and Content- 
Encoding. These define a two-layer, ordered encoding model: 

entity-body := Con tent -Encod ing( Content-Type ( data ) ) 

Content-Type specifies the media type of the underlying data. 
Con tent -Encoding may be used to indicate any additional content 
codings applied to the data, usually for the purpose of data 
compression, that are a property of the requested resource. There is 
no default encoding. 



7.2.1 Type 
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Any HTTP/1.1 message containing an entity-body SHOULD include a 
Content-Type header field defining the media type of that body. If 
and only if the media type is not given by a Content-Type field, the 
recipient MAY attempt to guess the media type via inspection of its 
content and/or the name extension(s) of the URL used to identify the 
resource. If the media type remains unknown, the recipient SHOULD 
treat it as type "appl icat ion/octet-stream". 

7.2.2 Length 

The length of an entity-body is the length of the message-body after 
any transfer codings have been removed. Section 4.4 defines how the 
length of a message-body is determined. 

8 Connections 

8.1 Persistent Connections 
8. 1. 1 Purpose 

Prior to persistent connections, a separate TCP connection was 
established to fetch each URL, increasing the load on HTTP servers 
and causing congestion on the Internet. The use of inline images and 
other associated data often requires a client to make multiple 
requests of the same server in a short amount of time. Analyses of 
these performance problems are available [30] [27] ; analysis and 
results from a prototype implementation are in [26]. 

Persistent HTTP connections have a number of advantages: 

o By opening and closing fewer TCP connections, CPU time is saved, 
and memory used for TCP protocol control blocks is also saved. 

o HTTP requests and responses can be pipelined on a connection. 
Pipelining allows a client to make multiple requests without 
waiting for each response, allowing a single TCP connection to be 
used much more efficiently, with much lower elapsed time. 

o Network congestion is reduced by reducing the number of packets 
caused by TCP opens, and by allowing TCP sufficient time to 
determine the congestion state of the network. 

o HTTP can evolve more gracefully; since errors can be reported 
without the penalty of closing the TCP connection. Clients using 
future versions of HTTP might optimistically try a new feature, but 
if communicating with an older server, retry with old semantics 
after an error is reported. 

HTTP implementations SHOULD implement persistent connections. 
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8.1.2 Overall Operation 

A significant difference between HTTP/1.1 and earlier versions of 
HTTP is that persistent connections are the default behavior of any 
HTTP connection. That is, unless otherwise indicated, the client may 
assume that the server will maintain a persistent connection. 

Persistent connections provide a mechanism by which a client and a 
server can signal the close of a TCP connection. This signaling takes 
place using the Connection header field. Once a close has been 
signaled, the client MUST not send any more requests on that 
connection. 

8.1.2.1 Negotiation 

An HTTP/1.1 server MAY assume that a HTTP/1. 1 client intends to 
maintain a persistent connection unless a Connection header including 
the connect ion- token "close" was sent in the request. If the server 
chooses to close the connection immediately after sending the 
response, it SHOULD send a Connection header including the 
connect ion- token close. 

An HTTP/1.1 client MAY expect a connection to remain open, but would 
decide to keep it open based on whether the response from a server 
contains a Connection header with the connect ion- token close. In case 
the client does not want to maintain a connection for more than that 
request, it SHOULD send a Connection header including the 
connect ion- token close. 

If either the client or the server sends the close token in the 
Connection header, that request becomes the last one for the 
connection. 

Clients and servers SHOULD NOT assume that a persistent connection is 
maintained for HTTP versions less than 1.1 unless it is explicitly 
signaled. See section 19.7.1 for more information on backwards 
compatibility with HTTP/1.0 clients. 

In order to remain persistent, all messages on the connection must 
have a self-defined message length (i.e., one not defined by closure 
of the connection), as described in section 4.4. 

8. 1.2.2 Pipelining 

A client that supports persistent connections MAY "pipeline" its 
requests (i.e., send multiple requests without waiting for each 
response). A server MUST send its responses to those requests in the 
same order that the requests were received. 
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Clients which assume persistent connections and pipeline immediately 
after connection establishment SHOULD be prepared to retry their 
connection if the first pipelined attempt fails. If a client does 
such a retry, it MUST NOT pipeline before it knows the connection is 
persistent. Clients MUST also be prepared to resend their requests if 
the server closes the connection before sending all of the 
corresponding responses. 

8.1.3 Proxy Servers 

It is especially important that proxies correctly implement the 
properties of the Connection header field as specified in 14.2.1. 

The proxy server MUST signal persistent connections separately with 
its clients and the origin servers (or other proxy servers) that it 
connects to. Each persistent connection applies to only one transport 
link. 

A proxy server MUST NOT establish a persistent connection with an 
HTTP/1.0 client. 

8.1.4 Practical Considerations 

Servers will usually have some time-out value beyond which they will 
no longer maintain an inactive connection. Proxy servers might make 
this a higher value since it is likely that the client will be making 
more connections through the same server. The use of persistent 
connections places no requirements on the length of this time-out for 
either the client or the server. 

When a client or server wishes to time-out it SHOULD issue a graceful 
close on the transport connection. Clients and servers SHOULD both 
constantly watch for the other side of the transport close, and 
respond to it as appropriate. If a client or server does not detect 
the other side's close promptly it could cause unnecessary resource 
drain on the network. 

A client, server, or proxy MAY close the transport connection at any 
time. For example, a client MAY have started to send a new request at 
the same time that the server has decided to close the "idle" 
connection. From the server's point of view, the connection is being 
closed while it was idle, but from the client's point of view, a 
request is in progress. 

This means that clients, servers, and proxies MUST be able to recover 
from asynchronous close events. Client software SHOULD reopen the 
transport connection and retransmit the aborted request without user 
interaction so long as the request method is idempotent (see section 
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9.1.2); other methods MUST NOT be automatically retried, although 
user agents MAY offer a human operator the choice of retrying the 
request. 

However, this automatic retry SHOULD NOT be repeated if the second 
request fails. 

Servers SHOULD always respond to at least one request per connection, 
if at all possible. Servers SHOULD NOT close a connection in the 
middle of transmitting a response, unless a network or client failure 
is suspected. 

Clients that use persistent connections SHOULD limit the number of 
simultaneous connections that they maintain to a given server. A 
single-user client SHOULD maintain AT MOST 2 connections with any 
server or proxy. A proxy SHOULD use up to 2*N connections to another 
server or proxy, where N is the number of simultaneously active 
users. These guidelines are intended to improve HTTP response times 
and avoid congestion of the Internet or other networks. 

8.2 Message Transmission Requirements 

General requirements: 

o HTTP/1.1 servers SHOULD maintain persistent connections and use 
TCP's flow control mechanisms to resolve temporary overloads, 
rather than terminating connections with the expectation that 
clients will retry. The latter technique can exacerbate network 
congestion. 

o An HTTP/1.1 (or later) client sending a message-body SHOULD monitor 
the network connection for an error status while it is transmitting 
the request. If the client sees an error status, it SHOULD 
immediately cease transmitting the body. If the body is being sent 
using a "chunked" encoding (section 3.6), a zero length chunk and 
empty footer MAY be used to prematurely mark the end of the 
message. If the body was preceded by a Content -Length header, the 
client MUST close the connection. 

o An HTTP/1.1 (or later) client MUST be prepared to accept a 100 
(Continue) status followed by a regular response. 

o An HTTP/1. 1 (or later) server that receives a request from a 

HTTP/1. 0 (or earlier) client MUST NOT transmit the 100 (continue) 
response; it SHOULD either wait for the request to be completed 
normally (thus avoiding an interrupted request) or close the 
connection prematurely. 
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Upon receiving a method subject to these requirements from an 
HTTP/1. 1 (or later) client an HTTP/1.1 (or later) server MUST either 
respond with 100 (Continue) status and continue to read from the 
input stream, or respond with an error status. If it responds with an 
error status, it MAY close the transport (TCP) connection or it MAY 
continue to read and discard the rest of the request. It MUST NOT 
perform the requested method if it returns an error status. 

Clients SHOULD remember the version number of at least the most 
recently used server; if an HTTP/1.1 client has seen an HTTP/1.1 or 
later response from the server, and it sees the connection close 
before receiving any status from the server, the client SHOULD retry 
the request without user interaction so long as the request method is 
idempotent (see section 9.1.2); other methods MUST NOT be 
automatically retried, although user agents MAY offer a human 
operator the choice of retrying the request.. If the client does 
retry the request, the client 

o MUST first send the request header fields, and then 

o MUST wait for the server to respond with either a 100 (Continue) 
response, in which case the client should continue, or with an 
error status. 

If an HTTP/1.1 client has not seen an HTTP/1.1 or later response from 
the server, it should assume that the server implements HTTP/1.0 or 
older and will not use the 100 (Continue) response. If in this case 
the client sees the connection close before receiving any status from 
the server, the client SHOULD retry the request. If the client does 
retry the request to this HTTP/1.0 server, it should use the 
following "binary exponential backoff* algorithm to be assured of 
obtaining a reliable response: 

1. Initiate a new connection to the server 

2. Transmit the request-headers 

3. Initialize a variable R to the estimated round-trip time to the 
server (e.g., based on the time it took to establish the 
ejection), or to a constant value of 5 seconds if the round-trip 
time is not available.. 

4. Compute T = R * (2**N), where N is the number of previous retries 
of this request. 

5. Wait either for an error response from the server, or for T seconds 
(whichever comes first) 
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6. If no error response is received, after T seconds transmit the body 
of the request. 

7. If client sees that the connection is closed prematurely, repeat 
from step 1 until the request is accepted, an error response is 
received, or the user becomes impatient and terminates the retry 
process. 

No matter what the server version, if an error status is received, 
the client 

o MUST NOT continue and 

o MUST close the connection if it has not completed sending the 
message. 

An HTTP/1. 1 (or later) client that sees the connection close after 
receiving a 100 (Continue) but before receiving any other status 
SHOULD retry the request, and need not wait for 100 (Continue) 
response (but MAY do so if this simplifies the implementation). 

9 Method Definitions 

The set of common methods for HTTP/1.1 is defined below. Although 
this set can be expanded, additional methods cannot be assumed to 
share the same semantics for separately extended clients and servers. 

The Host request -header field (section 14.23) MUST accompany all 
HTTP/1.1 requests. 

9. 1 Safe and Idempotent Methods 

9.1.1 Safe Methods 

Implementers should be aware that the software represents the user in 
their interactions over the Internet, and should be careful to allow 
the user to be aware of any actions they may take which may have an 
unexpected significance to themselves or others. 

In particular, the convention has been established that the GET and 
HEAD methods should never have the significance of taking an action 
other than retrieval. These methods should be considered "safe." This 
allows user agents to represent other methods, such as POST, PUT and 
DELETE, in a special way, so that the user is made aware of the fact 
that a possibly unsafe action is being requested. 

Naturally, it is not possible to ensure that the server does not 
generate side-effects as a result of performing a GET request; in 
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fact, some dynamic resources consider that a feature. The important 
distinction here is that the user did not request the side-effects, 
so therefore cannot be held accountable for them. 

9.1.2 Idempotent Methods 

Methods may also have the property of "idempotence" in that (aside 
from error or expiration issues) the side-effects of N > 0 identical 
requests is the same as for a single request. The methods GET, HEAD, 
PUT and DELETE share this property. 

9.2 OPTIONS 

The OPTIONS method represents a request for information about the 
communication options available on the request/response chain 
identified by the Request-URI. This method allows the client to 
determine the options and/or requirements associated with a resource, 
or the capabilities of a server, without implying a resource action 
or initiating a resource retrieval. 

Unless the server's response is an error, the response MUST NOT 
include entity information other than what can be considered as 
communication options (e.g., Allow is appropriate, but Content-Type 
is not). Responses to this method are not cachable. 

If the Request-URI is an asterisk ("*"), the OPTIONS request is 
intended to apply to the server as a whole. A 200 response SHOULD 
include any header fields which indicate optional features 
implemented by the server (e.g., Public), including any extensions 
not defined by this specification, in addition to any applicable 
general or response-header fields. As described in section 5.1.2, an 
"OPTIONS *" request can be applied through a proxy by specifying the 
destination server in the Request-URI without any path information. 

If the Request-URI is not an asterisk, the OPTIONS request applies 
only to the options that are available when communicating with that 
resource. A 200 response SHOULD include any header fields which 
indicate optional features implemented by the server and applicable 
to that resource (e.g., Allow), including any extensions not defined 
by this specification, in addition to any applicable general or 
response-header fields. If the OPTIONS request passes through a 
proxy, the proxy MUST edit the response to exclude those options 
which apply to a proxy* s capabilities and which are known to be 
unavailable through that proxy. 
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9. 3 GET 

The GET method means retrieve whatever information (in the form of an 
entity) is identified by the Request-URL If the Request-URI refers 
to a data-producing process, it is the produced data which shall be 
returned as the entity in the response and not the source text of the 
process, unless that text happens to be the output of the process. 

The semantics of the GET method change to a "conditional GET" if the 
request message includes an If-Modif ied-Since, If-Unmodif ied-Since, 
If-Match, If-None-Match, or If-Range header field. A conditional GET 
method requests that the entity be transferred only under the 
circumstances described by the conditional header field(s). The 
conditional GET method is intended to reduce unnecessary network 
usage by allowing cached entities to be refreshed without requiring 
multiple requests or transferring data already held by the client. 

The semantics of the GET method change to a "partial GET" if the 
request message includes a Range header field. A partial GET requests 
that only part of the entity be transferred, as described in section 
14.36. The partial GET method is intended to reduce unnecessary 
network usage by allowing partially-retrieved entities to be 
completed without transferring data already held by the client. 

The response to a GET request is cachable if and only if it meets the 
requirements for HTTP caching described in section 13. 

9. 4 HEAD 

The HEAD method is identical to GET except that the server MUST NOT 
return a message-body in the response. The metainformat ion contained 
in the HTTP headers in response to a HEAD request SHOULD be identical 
to the information sent in response to a GET request. This method can 
be used for obtaining metainformat ion about the entity implied by the 
request without transferring the entity-body itself. This method is 
often used for testing hypertext links for validity, accessibility, 
and recent modification. 

The response to a HEAD request may be cachable in the sense that the 
information contained in the response may be used to update a 
previously cached entity from that resource. If the new field values 
indicate that the cached entity differs from the current entity (as 
would be indicated by a change in Content -Length, Content-MD5, ETag 
or Last-Modified), then the cache MUST treat the cache entry as 
stale. 
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9. 5 POST 

The POST method is used to request that the destination server accept 
the entity enclosed in the request as a new subordinate of the 
resource identified by the Request-URI in the Request-Line. POST is 
designed to allow a uniform method to cover the following functions: 

o Annotation of existing resources; 

o Posting a message to a bulletin board, newsgroup, mailing list, 
or similar group of articles; 

o Providing a block of data, such as the result of submitting a 
form, to a data-handling process; 

o Extending a database through an append operation. 

The actual function performed by the POST method is determined by the 
server and is usually dependent on the Request-URI. The posted entity 
is subordinate to that URI in the same way that a file is subordinate 
to a directory containing it, a news article is subordinate to a 
newsgroup to which it is posted, or a record is subordinate to a 
database. 

The action performed by the POST method might not result in a 
resource that can be identified by a URI. In this case, either 200 
(OK) or 204 (No Content) is the appropriate response status, 
depending on whether or not the response includes an entity that 
describes the result. 

If a resource has been created on the origin server, the response 
SHOULD be 201 (Created) and contain an entity which describes the 
status of the request and refers to the new resource, and a Location 
header (see section 14.30). 

Responses to this method are not cachable, unless the response 
includes appropriate Cache-Control or Expires header fields. However, 
the 303 (See Other) response can be used to direct the user agent to 
retrieve a cachable resource. 

POST requests must obey the message transmission requirements set out 
in section 8.2. 
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9.6 PUT 

The PUT method requests that the enclosed entity be stored under the 
supplied Request-URI. If the Request-URI refers to an already 
existing resource, the enclosed entity SHOULD be considered as a 
modified version of the one residing on the origin server. If the 
Request-URI does not point to an existing resource, and that URI is 
capable of being defined as a new resource by the requesting user 
agent, the origin server can create the resource with that URI. If a 
new resource is created, the origin server MUST inform the user agent 
via the 201 (Created) response. If an existing resource is modified, 
either the 200 (OK) or 204 (No Content) response codes SHOULD be sent 
to indicate successful completion of the request. If the resource 
could not be created or modified with the Request-URI, an appropriate 
error response SHOULD be given that reflects the nature of the 
problem. The recipient of the entity MUST NOT ignore any Content-* 
(e.g. Content-Range) headers that it does not understand or implement 
and MUST return a 501 (Not Implemented) response in such cases. 

If the request passes through a cache and the Request-URI identifies 
one or more currently cached entities, those entries should be 
treated as stale. Responses to this method are not cachable. 

The fundamental difference between the POST and PUT requests is 
reflected in the different meaning of the Request-URI. The URI in a 
POST request identifies the resource that will handle the enclosed 
entity. That resource may be a data-accepting process, a gateway to 
some other protocol, or a separate entity that accepts annotations. 
In contrast, the URI in a PUT request identifies the entity enclosed 
with the request — the user agent knows what URI is intended and the 
server MUST NOT attempt to apply the request to some other resource. 
If the server desires that the request be applied to a different URI, 
it MUST send a 301 (Moved Permanently) response; the user agent MAY 
then make its own decision regarding whether or not to redirect the 
request. 

A single resource MAY be identified by many different URIs. For 
example, an article may have a URI for identifying "the current 
version" which is separate from the URI identifying each particular 
version. In this case, a PUT request on a general URI may result in 
several other URIs being defined by the origin server. 

HTTP/1.1 does not define how a PUT method affects the state of an 
origin server. 

PUT requests must obey the message transmission requirements set out 
in sect ion 8. 2. 
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9. 7 DELETE 

The DELETE method requests that the origin server delete the resource 
identified by the Request-URI. This method MAY be overridden by human 
intervention (or other means) on the origin server. The client cannot 
be guaranteed that the operation has been carried out, even if the 
status code returned from the origin server indicates that the action 
has been completed successfully. However, the server SHOULD not 
indicate success unless, at the time the response is given, it 
intends to delete the resource or move it to an inaccessible 
location. 

A successful response SHOULD be 200 (OK) if the response includes an 
entity describing the status, 202 (Accepted) if the action has not 
yet been enacted, or 204 (No Content) if the response is OK but does 
not include an entity. 

If the request passes through a cache and the Request-URI identifies 
one or more currently cached entities, those entries should be 
treated as stale. Responses to this method are not cachable. 

9.8 TRACE 

The TRACE method is used to invoke a remote, appl i cat ion- layer loop- 
back of the request message. The final recipient of the request 
SHOULD reflect the message received back to the client as the 
entity-body of a 200 (OK; response. The final recipient is either the 
origin server or the first proxy or gateway to receive a Max-Forwards 
value of zero (0) in the request (see section 14.31). A TRACE request 
MUST NOT include an entity. 

TRACE allows the client to see what is being received at the other 
end of the request chain and use that data for testing or diagnostic 
information. The value of the Via header field (section 14.44) is of 
particular interest, since it acts as a trace of the request chain. 
Use of the Max-Forwards header field allows the client to limit the 
length of the request chain, which is useful for testing a chain of 
proxies forwarding messages in an infinite loop. 

If successful, the response SHOULD contain the entire request message 
in the entity-body, with a Content-Type of "message/http". Responses 
to this method MUST NOT be cached. 

10 Status Code Definitions 

Each Status-Code is described below, including a description of which 
method(s) it can follow and any metainformat ion required in the 
response. 
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10.1 Informational lxx 

This class of status code indicates a provisional response, 
consisting only of the Status-Line and optional headers, and is 
terminated by an empty line. Since HTTP/1. 0 did not define any lxx 
status codes, servers MUST NOT send a lxx response to an HTTP/1.0 
client except under experimental conditions. 

10. 1. 1 100 Continue 

The client may continue with its request. This interim response is 
used to inform the client that the initial part of the request has 
been received and has not yet been rejected by the server. The client 
SHOULD continue by sending the remainder of the request or, if the 
request has already been completed, ignore this response. The server 
MUST send a final response after the request has been completed. 

10.1.2 101 Switching Protocols 

The server understands and is willing to comply with the client's 
request, via the Upgrade message header field (section 14.41), for a 
change in the application protocol being used on this connection. The 
server will switch protocols to those defined by the response's 
Upgrade header field immediately after the empty line which 
terminates the 101 response. 

The protocol should only be switched when it is advantageous to do 
so. For example, switching to a newer version of HTTP is 
advantageous over older versions, and switching to a real-time, 
synchronous protocol may be advantageous when delivering resources 
that use such features. 

10. 2 Successful 2xx 

This class of status code indicates that the client's request was 
successfully received, understood, and accepted. 

10. 2. 1 200 OK 

The request has succeeded. The information returned with the response 
is dependent on the method used in the request, for example: 

GET an entity corresponding to the requested resource is sent in the 
response; 

HEAD the entity-header fields corresponding to the requested resource 
are sent in the response without any message-body; 
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POST an entity describing or containing the result of the action; 

TRACE an entity containing the request message as received by the end 
server. 

10.2.2 201 Created 

The request has been fulfilled and resulted in a new resource being 
created. The newly created resource can be referenced by the URI(sJ 
returned in the entity of the response, with the most specific URL 
for the resource given by a Location header field. The origin server 
MUST create the resource before returning the 201 status code. If the 
action cannot be carried out immediately, the server should respond 
with 202 (Accepted) response instead. 

10. 2. 3 202 Accepted 

The request has been accepted for processing, but the processing has 
not been completed. The request MAY or MAY NOT eventually be acted 
upon, as it MAY be disallowed when processing actually takes place. 
There is no facility for re-sending a status code from an 
asynchronous operation such as this. 

The 202 response is intentionally non-committal. Its purpose is to 
allow a server to accept a request for some other process (perhaps a 
batch-oriented process that is only run once per day) without 
requiring that the user agent's connection to the server persist 
until the process is completed. The entity returned with this 
response SHOULD include an indication of the request* s current status 
and either a pointer to a status monitor or some estimate of when the 
user can expect the request to be fulfilled. 

10.2.4 203 Non-Author i tat ive Information 

The returned metainformat ion in the entity-header is not the 
definitive set as available from the origin server, but is gathered 
from a local or a third-party copy. The set presented MAY be a subset 
or superset of the original version. For example, including local 
annotation information about the resource MAY result in a superset of 
the metainformat ion known by the origin server. Use of this response 
code is not required and is only appropriate when the response would 
otherwise be 200 (OK). 

10.2.5 204 No Content 

The server has fulfilled the request but there is no new information 
to send back. If the client is a user agent, it SHOULD NOT change its 
document view from that which caused the request to be sent. This 
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response is primarily intended to allow input for actions to take 
place without causing a change to the user agent's active document 
view. The response MAY include new metainf ormat ion in the form of 
entity-headers, which SHOULD apply to the document currently in the 
user agent's active view. 

The 204 response MUST NOT include a message-body, and thus is always 
terminated by the first empty line after the header fields. 

10.2.6 205 Reset Content 

The server has fulfilled the request and the user agent SHOULD reset 
the document view which caused the request to be sent. This response 
is primarily intended to allow input for actions to take place via 
user input, followed by a clearing of the form in which the input is 
given so that the user can easily initiate another input action. The 
response MUST NOT include an entity. 

10.2.7 206 Partial Content 

The server has fulfilled the partial GET request for the resource. 
The request must have included a Range header field (section 14.36) 
indicating the desired range. The response MUST include either a 
Content-Range header field (section 14.17) indicating the range 
included with this response, or a mul t ipart/byteranges Content-Type 
including Content-Range fields for each part. If mul t ipart/byteranges 
is not used, the Content-Length header field in the response MUST 
match the actual number of OCTETs transmitted in the message-body. 

A cache that does not support the Range and Content-Range headers 
MUST NOT cache 206 (Partial) responses. 

10.3 Redirection 3xx 

This class of status code indicates that further action needs to be 
taken by the user agent in order to fulfill the request. The action 
required MAY be carried out by the user agent without interaction 
with the user if and only if the method used in the second request is 
GET or HEAD. A user agent SHOULD NOT automatically redirect a request 
more than 5 times, since such redirections usually indicate an 
infinite loop. 
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10.3.1 300 Multiple Choices 

The requested resource corresponds to any one of a set of 
representations, each with its own specific location, and agent- 
driven negotiation information (section 12) is being provided so that 
the user (or user agent) can select a preferred representation and 
redirect its request to that location. 

Unless it was a HEAD request, the response SHOULD include an entity 
containing a list of resource characteristics and location(s) from 
which the user or user agent can choose the one most appropriate. The 
entity format is specified by the media type given in the Content- 
Type header field. Depending upon the format and the capabilities of 
the user agent, selection of the most appropriate choice may be 
performed automatically. However, this specification does not define 
any standard for such automatic selection. 

If the server has a preferred choice of representation, it SHOULD 
include the specific URL for that representation in the Location 
field; user agents MAY use the Location field value for automatic 
redirection. This response is cachable unless indicated otherwise. 

10.3.2 301 Moved Permanently 

The requested resource has been assigned a new permanent URI and any 
future references to this resource SHOULD be done using one of the 
returned URIs. Clients with link editing capabilities SHOULD 
automatically re-link references to the Request-URI to one or more of 
the new references returned by the server, where possible. This 
response is cachable unless indicated otherwise. 

If the new URI is a location, its URL SHOULD be given by the Location 
field in the response. Unless the request method was HEAD, the entity 
of the response SHOULD contain a short hypertext note with a 
hyperlink to the new URI(s). 

If the 301 status code is received in response to a request other 
than GET or HEAD, the user agent MUST NOT automatically redirect the 
request unless it can be confirmed by the user, since this might 
change the conditions under which the request was issued. 

Note: When automatically redirecting a POST request after receiving 
a 301 status code, some existing HTTP/1.0 user agents will 
erroneously change it into a GET request. 
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10.3.3 302 Moved Temporarily 

The requested resource resides temporarily under a different URL 
Since the redirection may be altered on occasion, the client SHOULD 
continue to use the Request-URI for future requests. This response is 
only cachable if indicated by a Cache-Control or Expires header 
field. 

If the new URI is a location, its URL SHOULD be given by the Location 
field in the response. Unless the request method was HEAD, the entity 
of the response SHOULD contain a short hypertext note with a 
hyperlink to the new URI(s). 

If the 302 status code is received in response to a request other 
than GET or HEAD, the user agent MUST NOT automatically redirect the 
request unless it can be confirmed by the user, since this might 
change the conditions under which the request was issued. 

Note: When automatically redirecting a POST request after receiving 
a 302 status code, some existing HTTP/1.0 user agents will 
erroneously change it into a GET request. 

10.3.4 303 See Other 

The response to the request can be found under a different URI and 
SHOULD be retrieved using a GET method on that resource. This method 
exists primarily to allow the output of a POST-act ivated script to 
redirect the user agent to a selected resource. The new URI is not a 
substitute reference for the originally requested resource. The 303 
response is not cachable, but the response to the second (redirected) 
request MAY be cachable. 

If the new URI is a location, its URL SHOULD be given by the Location 
field in the response. Unless the request method was HEAD, the entity 
of the response SHOULD contain a short hypertext note with a 
hyperlink to the new URI(s). 

10.3.5 304 Not Modified 

If the client has performed a conditional GET request and access is 
allowed, but the document has not been modified, the server SHOULD 
respond with this status code. The response MUST NOT contain a 
message-body. 
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The response MUST include the following header fields: 
o Date 

o ETag and/or Content-Location, if the header would have been sent in 
a 200 response to the same request 

o Expires, Cache-Control, and/or Vary, if the field-value might 

differ from that sent in any previous response for the same variant 

If the conditional GET used a strong cache validator (see section 
13.3.3), the response SHOULD NOT include other entity-headers. 
Otherwise (i.e., the conditional GET used a weak validator), the 
response MUST NOT include other entity-headers; this prevents 
inconsistencies between cached entity-bodies and updated headers. 

If a 304 response indicates an entity not currently cached, then the 
cache MUST disregard the response and repeat the request without the 
conditional. 

If a cache uses a received 304 response to update a cache entry,- the 
cache MUST update the entry to reflect any new field values given in 
the response. 

The 304 response MUST NOT include a message-body, and thus is always 
terminated by the first empty line after the header fields. 

10.3.6 305 Use Proxy 

The requested resource MUST be accessed through the proxy given by 
the Location field. The Location field gives the URL of the proxy. 
The recipient is expected to repeat the request via the proxy. 

10.4 Client Error 4xx 

The 4xx class of status code is intended for cases in which the 
client seems to have erred. Except when responding to a HEAD request, 
the server SHOULD include an entity containing an explanation of the 
error situation, and whether it is a temporary or permanent 
condition. These status codes are applicable to any request method. 
User agents SHOULD display any included entity to the user. 

Note: If the client is sending data, a server implementation using 
TCP should be careful to ensure that the client acknowledges 
receipt of the packet (s) containing the response, before the server 
closes the input connection. If the client continues sending data 
to the server after the close, the server's TCP stack will send a 
reset packet to the client, which may erase the client's 
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unacknowledged input buffers before they can be read and 
interpreted by the HTTP application. 

10.4. 1 400 Bad Request 

The request could not be understood by the server due to malformed 
syntax. The client SHOULD NOT repeat the request without 
modifications. 

10.4.2 401 Unauthorized 

The request requires user authentication. The response MUST include a 
WWW-Authenticate header field (section 14.46) containing a challenge 
applicable to the requested resource. The client MAY repeat the 
request with a suitable Authorization header field (section 14.8). If 
the request already included Authorization credentials, then the 401 
response indicates that authorization has been refused for those 
credentials. If the 401 response contains the same challenge as the 
prior response, and the user agent has already attempted 
authentication at least once, then the user SHOULD be presented the 
entity that was given in the response, since that entity MAY include 
relevant diagnostic information. HTTP access authentication is 
explained in section 11. 

10.4.3 402 Payment Required 

This code is reserved for future use. 

10.4.4 403 Forbidden 

The server understood the request, but is refusing to fulfill it. 
Authorization will not help and the request SHOULD NOT be repeated. 
If the request method was not HEAD and the server wishes to make 
public why the request has not been fulfilled, it SHOULD describe the 
reason for the refusal in the entity. This status code is commonly 
used when the server does not wish to reveal exactly why the request 
has been refused, or when no other response is applicable. 

10.4.5 404 Not Found 

The server has not found anything matching the Request-URL No 
indication is given of whether the condition is temporary or 
permanent. 
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If the server does not wish to make this information available to the 
client, the status code 403 (Forbidden) can be used instead. The 410 
(Gone) status code SHOULD be used if the server knows, through some 
internally configurable mechanism, that an old resource is 
permanently unavailable and has no forwarding address. 

10.4.6 405 Method Not Allowed 

The method specified in the Request-Line is not allowed for the 
resource identified by the Request-URI. The response MUST include an 
Allow header containing a list of valid methods for the requested 
resource. 

10.4.7 406 Not Acceptable 

The resource identified by the request is only capable of generating 
response entities which have content characteristics not acceptable 
according to the accept headers sent in the request. 

Unless it was a HEAD request, the response SHOULD include an entity 
containing a list of ■ available* entity characteristics and location(s) 
from which the user or user agent can choose the one most 
appropriate. The entity format is specified by the media type given 
in the Content-Type header field. Depending upon the format and the 
capabilities of the user agent, selection of the most appropriate 
choice may be performed automatically. However, this specification 
does not define any standard for such automatic selection. 

Note: HTTP/1.1 servers are allowed to return responses which are 
not acceptable according to the accept headers sent in the request. 
In some cases, this may even be preferable to sending a 406 
response. User agents are encouraged to inspect the headers of an 
incoming response to determine if it is acceptable. If the response 
could be unacceptable, a user agent SHOULD temporarily stop receipt 
of more data and query the user for a decision on further actions. 

10.4.8 407 Proxy Authentication Required 

This code is similar to 401 (Unauthorized), but indicates that the 
client MUST first authenticate itself with the proxy. The proxy MUST 
return a Proxy-Authenticate header field (section 14.33) containing a 
challenge applicable to the proxy for the requested resource. The 
client MAY repeat the request with a suitable Proxy-Authorizat ion 
header field (section 14.34). HTTP access authentication is explained 
in section 11. 
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10.4.9 408 Request Timeout 

The client did not produce a request within the time that the server 
was prepared to wait. The client MAY repeat the request without 
modifications at any later time. 

10.4. 10 409 Conflict 

The request could not be completed due to a conflict with the current 
state of the resource. This code is only allowed in situations where 
it is expected that the user might be able to resolve the conflict 
and resubmit the request. The response body SHOULD include enough 
information for the user to recognize the source of the conflict. 
Ideally, the response entity would include enough information for the 
user or user agent to fix the problem; however, that may not be 
possible and is not required. 

Conflicts are most likely to occur in response to a PUT request. If 
versioning is being used and the entity being PUT includes changes to 
a resource which conflict with those made by an earlier (third-party) 
request, the server MAY use the 409 response to indicate that it 
can't complete the request. In this case, the response entity SHOULD 
contain a list of the differences between the two versions in a 
format defined by the response Content-Type. 

10.4. 11 410 Gone 

The requested resource is no longer available at the server and no 
forwarding address is known. This condition SHOULD be considered 
permanent. Clients with link editing capabilities SHOULD delete 
references to the Request-URI after user approval. If the server does 
not know, or has no facility to determine, whether or not the 
condition is permanent, the status code 404 (Not Found) SHOULD be 
used instead. This response is cachable unless indicated otherwise. 

The 410 response is primarily intended to assist the task of web 
maintenance by notifying the recipient that the resource is 
intentionally unavailable and that the server owners desire that 
remote links to that resource be removed. Such an event is common for 
limited-time, promotional services and for resources belonging to 
individuals no longer working at the server's site. It is not 
necessary to mark all permanently unavailable resources as "gone" or 
to keep the mark for any length of time — that is left to the 
discretion of the server owner. 
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10.4.12 411 Length Required 

The server refuses to accept the request without a defined Content- 
Length. The client MAY repeat the request if it adds a valid 
Content-Length header field containing the length of the message-body 
in the request message. 

10.4.13 412 Precondition Failed 

The precondition given in one or more of the request -header fields 
evaluated to false when it was tested on the server. This response 
code allows the client to place preconditions on the current resource 
metainformat ion (header field data) and thus prevent the requested 
method from being applied to a resource other than the one intended. 

10.4.14 413 Request Entity Too Large 

The server is refusing to process a request because the request 
entity is larger than the server is willing or able to process. The 
server may close the connection to prevent the client from continuing 
the request. 

If the condition is temporary, the server SHOULD include a Retry- 
After header field to indicate that it is temporary and after what 
time the client may try again. 

10.4.15 414 Request-URI Too Long 

The server is refusing to service the request because the Request-URI 
is longer than the server is willing to interpret. This rare 
condition is only likely to occur when a client has improperly 
converted a POST request to a GET request with long query 
information, when the client has descended into a URL "black hole" of 
redirection (e.g., a redirected URL prefix that points to a suffix. of 
itself), or when the server is under attack by a client attempting to 
exploit security holes present in some servers using fixed-length 
buffers for reading or manipulating the Request-URI. 

10.4.16 415 Unsupported Media Type 

The server is refusing to service the request because the entity of 
the request is in a format not supported by the requested resource 
for the requested method. 
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10.5 Server Error 5xx 

Response status codes. beginning with the digit "5" indicate cases in 
which the server is aware that it has erred or is incapable of 
performing the request. Except when responding to a HEAD request, the 
server SHOULD include an entity containing an explanation of the 
error situation, and whether it is a temporary or permanent 
condition. User agents SHOULD display any included entity to the 
user. These response codes are applicable to any request method. 

10.5.1 500 Internal Server Error 

The server encountered an unexpected condition which prevented it 
from fulfilling the request. 

10.5.2 501 Not Implemented 

The server does not support the functionality required to fulfill the 
request. This is the appropriate response when the server does not 
recognize the request method and is not capable of supporting it for 
any resource. 

10.5.3 502 Bad Gateway 

The server, while acting as a gateway or proxy, received an invalid 
response from the upstream server it accessed in attempting to 
fulfill the request. 

10.5.4 503 Service Unavailable 

The server is currently unable to handle the request due to a 
temporary overloading or maintenance of the server. The implication 
is that this is a temporary condition which will be alleviated after 
some delay. If known, the length of the delay may be indicated in a 
Retry-After header. If no Retry-After is given, the client SHOULD 
handle the response as it would for a 500 response. 

Note: The existence of the 503 status code does not imply that a 
server must use it when becoming overloaded. Some servers may wish 
to simply refuse the connection. 

10.5.5 504 Gateway Timeout 

The server, while acting as a gateway or proxy, did not receive a 
timely response from the upstream server it accessed in attempting to 
complete the request. 
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10.5.6 505 HTTP Version Not Supported 

The server does not support, or refuses to support, the HTTP protocol 
version that was used in the request message. The server is 
indicating that it is unable or unwilling to complete the request 
using the same major version as the client, as described in section 
3.1, other than with this error message. The response SHOULD contain 
an entity describing why that version is not supported and what other 
protocols are supported by that server. 

11 Access Authentication 

HTTP provides a simple challenge-response authentication mechanism 
which MAY be used by a server to challenge a client request and by a 
client to provide authentication information. It uses an extensible, 
case-insensitive token to identify the authentication scheme, 
followed by a comma-separated list of attribute-value pairs which 
carry the parameters necessary for achieving authentication via that 
scheme. 

auth-scheme = token 

auth-param = token quoted-string 

The 401 (Unauthorized) response message is used by an origin server 
to challenge the authorization of a user agent. This response MUST 
include a WWW-Au t hen t i ca t e header field containing at least one 
challenge applicable to the requested resource. 

challenge = auth-scheme 1*SP realm *( " auth-param ) 

realm = "realm" realm-value 

realm-value = quoted-string 

The realm attribute (case-insensitive) is required for all 
authentication schemes which issue a challenge. The realm value 
(case-sensitive), in combination with the canonical root URL (see 
section 5.1.2) of the server being accessed, defines the protection 
space. These realms allow the protected resources on a server to be 
partitioned into a set of protection spaces, each with its own 
authentication scheme and/or authorization database. The realm value 
is a string, generally assigned by the origin server, which may have 
additional semantics specific to the authentication scheme. 

A user agent that wishes to authenticate itself with a server — 
usually, but not necessarily, after receiving a 401 or 411 response- 
-MAY do so by including an Authorization header field with the 
request. The Authorization field value consists of credentials 
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containing the authentication information of the user agent for the 
realm of the resource being requested. 

credentials = basic-credentials 

I auth-scheme #auth-param 

The domain over which credentials can be automatically applied by a 
user agent is determined by the protection space. If a prior request 
has been authorized, the same credentials MAY be reused for all other 
requests within that protection space for a period of time determined 
by the authentication scheme, parameters, and/or user preference. 
Unless otherwise defined by the authentication scheme, a single 
protection space cannot extend outside the scope of its server. 

If the server does not wish to accept the credentials sent with a 
request, it SHOULD return a 401 (Unauthorized) response. The response 
MUST include a WWW-Authent icate header field containing the (possibly 
new) challenge applicable to the requested resource and an entity 
explaining the refusal. 

The HTTP protocol does not restrict applications to this simple 
challenge-response mechanism for access authentication. Additional 
mechanisms MAY be used, such as encryption at the transport level or 
via. message encapsulation, and with additional header fields 
specifying authentication information. However, these additional 
mechanisms are not defined by this specification. 

Proxies MUST be completely transparent regarding user agent 
authentication. That is, they MUST forward the WWW-Authent icate and 
Authorization headers untouched, and follow the rules found in 
section 14.8. 

HTTP/1.1 allows a client to pass authentication information to and 
from a proxy via the Proxy-Authenticate and Proxy-Authorizat ion 
headers. 

11.1 Basic Authentication Scheme 

The "basic" authentication scheme is based on the model that the user 
agent must authenticate itself with a user-ID and a password for each 
realm. The realm value should be considered an opaque string which 
can only be compared for equality with other realms on that server. 
The server will service the request only if it can validate the 
user-ID and password for the protection space of the Request-URI. 
There are no optional authentication parameters. 
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Upon receipt of an unauthorized request for a URI within the 
protection space, the server MAY respond with a challenge like the 
fol lowing: 

WWW-Authenticate: Basic realm="Wal lyWorld" 

where "Wal lyWorld" is the string assigned by the server to identify 
the protection space of the Request-URL 

To receive authorization, the client sends the user id and password, 
separated by a single colon (":") character, within a base64 encoded 
string in the credentials. 

basic-credentials = "Basic" SP basic-cookie 

basic-cookie = <base64 [7] encoding of user-pass, 
except not limited to 76 char/1 ine> 

user-pass = user id ":" password 

userid = *<TEXT excluding ":"> 

password = *TEXT 

Userids might be case sensitive. 

If the user agent wishes to send the userid "Aladdin" and password 
"open sesame", it would use the following header field: 

Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ== 

See section 15 for security considerations associated with Basic 
authentication. 

11.2 Digest Authentication Scheme 

A digest authentication for HTTP is specified in RFC 2069 [32]. 

12 Content Negotiation 

Most HTTP responses include an entity which contains information for 
interpretation by a human user. Naturally, it is desirable to supply 
the user with the "best available" entity corresponding to the 
request. Unfortunately for servers and caches, not all users have 
the same preferences for what is "best," and not all user agents are 
equally capable of rendering all entity types. For that reason, HTTP 
has provisions for several mechanisms for "content negotiation" — 
the process of selecting the best representation for a given response 
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when there are multiple representations available. 

Note: This is not called "format negotiation" because the alternate 
representations may be of the same media type, but use different 
capabilities of that type, be in different languages, etc. 

Any response containing an entity-body MAY be subject to negotiation, 
including error responses. 

There are two kinds of content negotiation which are possible in 
HTTP: server-driven and agent-driven negotiation. These two kinds of 
negotiation are orthogonal and thus may be used separately or in 
combination. One method of combination, referred to as transparent 
negotiation, occurs when a cache uses the agent-driven negotiation 
information provided by the origin server in order to provide 
server-driven negotiation for subsequent requests. 

12.1 Server-driven Negotiation 

If the selection of the best representation for a response is made by 
an algorithm located at the server, it is called server-driven 
negotiation. Selection is based on the available representations of 
the response (the dimensions over which it can vary; e.g. language, 
content-coding, etc.) and the contents of particular header fields in 
the request message or on other information pertaining to the request 
(such as the network address of the client). 

Server-driven negotiation is advantageous when the algorithm for 
selecting from among the available representations is difficult to 
describe to the user agent, or when the server desires to send its 
"best guess" to the client along with the first response (hoping to 
avoid the round-trip delay of a subsequent request if the "best 
guess" is good enough for the user). In order to improve the server's 
guess, the user agent MAY include request header fields (Accept, 
Accept-Language, Accept -Encoding, etc.) which describe its 
preferences for such a response. 

Server-driven negotiation has disadvantages: 

1. It is impossible for the server to accurately determine what might be 
"best" for any given user, since that would require complete 
knowledge of both the capabilities of the user agent and the intended 
use for the response (e.g., does the user want to view it on screen 

or print it on paper?). 

2. Having the user agent describe its capabilities in every request can 
be both very inefficient (given that only a small percentage of 
responses have multiple representations) and a potential violation of 
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the user' s privacy. 

3. It complicates the implementation of an origin server and the 
algorithms for generating responses to a request. 

4. It may limit a public cache's ability to use the same response for 
multiple user's requests. 

HTTP/1.1 includes the following request -header fields for enabling 
server-driven negotiation through description of user agent 
capabilities and user preferences: Accept (section 14. lj, Accept- 
Charset (section 14.2), Accept -Encoding (section 14.3). Accept- 
Language (section 14.4), and User-Agent (section 14.42). However, an 
origin server is not limited to these dimensions and MAY vary the 
response based on any aspect of the request, including information 
outside the request-header fields or within extension header fields 
not defined by this specification. 

HTTP/1. 1 origin servers MUST include an appropriate Vary header field 
(section 14.43) in any cachable response based on server-driven 
negotiation. The Vary header field describes the dimensions over 
which the response might vary (i.e. the dimensions over which the 
origin server picks its "best guess" response from multiple 
representat i ons) . 

HTTP/1.1 public caches MUST recognize the Vary header field when it 
is included in a response and obey the requirements described in 
section 13.6 that describes the interactions between caching and 
content negotiation. 

12.2 Agent-driven Negotiation 

With agent-driven negotiation, selection of the best representation 
for a response is performed by the user agent after receiving an 
initial response from the origin server. Selection is based on a list 
of the available representations of the response included within the 
header fields (this specification reserves the field-name Alternates, 
as described in appendix 19.6.2.1) or entity-body of the initial 
response, with each representation identified by its own URI. 
Selection from among the representations may be performed 
automatically (if the user agent is capable of doing so) or manually 
by the user selecting from a generated (possibly hypertext) menu. 

Agent-driven negotiation is advantageous when the response would vary 
over commonly-used dimensions (such as type, language, or encoding), 
when the origin server is unable to determine a user agent's 
capabilities from examining the request, and generally when public 
caches are used to distribute server load and reduce network usage. 
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Agent-driven negotiation suffers from the disadvantage of needing a 
second request to obtain the best alternate representation. This 
second request is only efficient when caching is used. In addition, 
this specification does not define any mechanism for supporting 
automatic selection, though it also does not prevent any such 
mechanism from being developed as an extension and used within 
HTTP/1. 1. 

HTTP/1.1 defines the 300 (Multiple Choices) and 406 (Not Acceptable) 
status codes for enabling agent-driven negotiation when the server is 
unwilling or unable to provide a varying response using server-driven 
negotiation. 

12.3 Transparent Negotiation 

Transparent negotiation is a combination of both server-driven and 
agent-driven negotiation. When a cache is supplied with a form of the 
list of available representations of the response (as in agent-driven 
negotiation) and the dimensions of variance are completely understood 
by the cache, then the cache becomes capable of performing server- 
driven negotiation on behalf of the origin server for subsequent 
requests on that resource. 

Transparent negotiation has the advantage of distributing the 
negotiation work that would otherwise be required of the origin 
server and also removing the second request delay of agent-driven 
negotiation when the cache is able to correctly guess the right 
response. 

This specification does not define any mechanism for transparent 
negotiation, though it also does not prevent any such mechanism from 
being developed as an extension and used within HTTP/1. 1. An HTTP/1. 1 
cache performing transparent negotiation MUST include a Vary header 
field in the response (defining the dimensions of its variance) if it 
is cachable to ensure correct interoperat ion with all HTTP/1.1 
clients. The agent-driven negotiation information supplied by the 
origin server SHOULD be included with the transparently negotiated 
response. 

13 Caching in HTTP 

HTTP is typically used for distributed information systems, where 
performance can be improved by the use of response caches. The 
HTTP/1.1 protocol includes a number of elements intended to make 
caching work as well as possible. Because these elements are 
inextricable from other aspects of the protocol, and because they 
interact with each other, it is useful to describe the basic caching 
design of HTTP separately from the detailed descriptions of methods, 
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headers, response codes, etc. 

Caching would be useless if it did not significantly improve 
performance. The goal of caching in HTTP/1.1 is to eliminate the need 
to send requests in many cases, and to el iminate the need to send 
full responses in many other cases. The former reduces the number of 
network round-trips required for many operations; we use an 
"expiration" mechanism for this purpose (see section 13.2). The 
latter reduces network bandwidth requirements; we use a "validation" 
mechanism for this purpose (see section 13.3). 

Requirements for performance, availability, and disconnected 
operation require us to be able to relax the goal of semantic 
transparency. The HTTP/1. 1 protocol allows origin servers, caches, 
and clients to explicitly reduce transparency when necessary. 
However, because non-transparent operation may confuse non-expert 
users, and may be incompatible with certain server applications (such 
as those for ordering merchandise), the protocol requires that 
transparency be relaxed 

o only by an explicit protocol -level request when relaxed by client 
or origin server 

o only with an explicit warning to the end user when relaxed by cache 
or cl ient 
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Therefore, the HTTP/1.1 protocol provides these important elements: 

1. Protocol features that provide full semantic transparency when this 
is required by all parties. 

2. Protocol features that allow an origin server or user agent to 
explicitly request and control non- transparent operation. 

3. Protocol features that allow a cache to attach warnings to 
responses that do not preserve the requested approximation of 
semantic transparency. 

A basic principle is that it must be possible for the clients to 
detect any potential relaxation of semantic transparency. 

Note: The server, cache, or client implementer may be faced with 
design decisions not explicitly discussed in this specification. If 
a decision may affect semantic transparency, the implementer ought 
to err on the side of maintaining transparency unless a careful and 
complete analysis shows significant benefits in breaking 
transparency. 

13. 1. 1 Cache Correctness 

A correct cache MUST respond to a request with the most up-to-date 
response held by the cache that is appropriate to the request (see 
sections 13.2.5, 13.2.6, and 13.12) which meets one of the following 
condi tions: 

1. It has been checked for equivalence with what the origin server 
would have returned by revalidating the response with the origin 
server (section 13.3); 

2. It is "fresh enough" (see section 13.2). In the default case, this 
means it meets the least restrictive freshness requirement of the 
client, server, and cache (see section 14.9); if the origin server 
so specifies, it is the freshness requirement of the origin server 
alone. 

3. It includes a warning if the freshness demand of the client or the 
origin server is violated (see section 13.1.5 and 14.45). 

4. It is an appropriate 304 (Not Modified), 305 (Proxy Redirect), or 
error (4xx or 5xx) response message. 

If the cache can not communicate with the origin server, then a 
correct cache SHOULD respond as above if the response can be 
correctly served from the cache; if not it MUST return an error or 
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warning indicating that there was a communication failure. 

If a cache receives a response (either an entire response, or a 304 
(Not Modified) response) that it would normally forward to the 
requesting client, and the received response is no longer fresh, the 
cache SHOULD forward it to the requesting client without adding a new 
Warning (but without removing any existing Warning headers). A cache 
SHOULD NOT attempt to revalidate a response simply because that 
response became stale in transit; this might lead to an infinite 
loop. An user agent that receives a stale response without a Warning 
MAY display a warning indication to the user. 

13. 1.2 Warnings 

Whenever a cache returns a response that is neither first-hand nor 
"fresh enough" (in the sense of condition 2 in section 13.1.1), it 
must attach a warning to that effect, using a Warning response- 
header. This warning allows clients to take appropriate action. 

Warnings may be used for other purposes, both cache-related and 
otherwise. The use of a warning, rather than an error status code, 
distinguish these responses from true failures. 

Warnings are always cachable, because they never weaken the 
transparency of a response. This means that warnings can be passed to 
HTTP/1.0 caches without danger; such caches will simply pass the 
warning along as an entity-header in the response. 

Warnings are assigned numbers between 0 and 99. This specification 
defines the code numbers and meanings of each currently assigned 
warnings, allowing a client or cache to take automated action in some 
(but not all) cases. 

Warnings also carry a warning text. The text may be in any 
appropriate natural language (perhaps based on the client's Accept 
headers), and include an optional indication of what character set is 
used. 

Multiple warnings may be attached to a response (either by the origin 
server or by a cache), including multiple warnings with the same code 
number. For example, a server may provide the same warning with texts 
in both English and Basque. 

When multiple warnings are attached to a response, it may not be 
practical or reasonable to display all of them to the user. This 
version of HTTP does not specify strict priority rules for deciding 
which warnings to display and in what order, but does suggest some 
heuristics. 
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The Warning header and the currently defined warnings are described 
in section 14.45. 

13.1.3 Cache-control Mechanisms 

The basic cache mechanisms in HTTP/1.1 (server-specified expiration 
times and validators) are implicit directives to caches. In some 
cases, a server or client may need to provide explicit directives to 
the HTTP caches. We use the Cache-Control header for this purpose. 

The Cache-Control header allows a client or server to transmit a 
variety of directives in either requests or responses. These 
directives typically override the default caching algorithms. As a 
general rule, if there is any apparent conflict between header 
values, the most restrictive interpretation should be applied (that 
is, the one that is most likely to preserve semantic transparency). 
However, in some cases, Cache-Control directives are explicitly 
specified as weakening the approximation of semantic transparency 
(for example, "max-stale" or "public"). 

The Cache-Control directives are described in detail in section 14.9. 

13.1.4 Explicit User Agent Warnings 

Many user agents make it possible for users to override the basic 
caching mechanisms. For example, the user agent may allow the user to 
specify that cached entities (even explicitly stale ones) are never 
validated. Or the user agent might habitually add "Cache-Control: 
max-stale=3600" to every request. The user should have to explicitly 
request either non-transparent behavior, or behavior that results in 
abnormally ineffective caching. 

If the user has overridden the basic caching mechanisms, the user 
agent should explicitly indicate to the user whenever this results in 
the display of information that might not meet the server's 
transparency requirements (in particular, if the displayed entity is 
known to be stale). Since the protocol normally allows the user agent 
to determine if responses are stale or not, this indication need only 
be displayed when this actually happens. The indication need not be a 
dialog box; it could be an icon (for example, a picture of a rotting 
fish) or some other visual indicator. 

If the user has overridden the caching mechanisms in a way that would 
abnormally reduce the effectiveness of caches, the user agent should 
continually display an indication (for example, a picture of currency 
in flames) so that the user does not inadvertently consume excess 
resources or suffer from excessive latency. 
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13.1.5 Exceptions to the Rules and Warnings 

In some cases, the operator of a cache may choose to configure it to 
return stale responses even when not requested by clients. This 
decision should not be made lightly, but may be necessary for reasons 
of availability or performance, especially when the cache is poorly 
connected to the origin server. Whenever a cache returns a stale 
response, it MUST mark it as such (using a Warning header). This 
allows the client software to alert the user that there may be a 
potential problem. 

It also allows the user agent to take steps to obtain a first-hand or 
fresh response. For this reason, a cache SHOULD NOT return a stale 
response if the client explicitly requests a first-hand or fresh one, 
unless it is impossible to comply for technical or policy reasons. 

13.1.6 Client-controlled Behavior 

While the origin server (and to a lesser extent, intermediate caches, 
by their contribution to the age of a response) are the primary 
source of expiration information, in some cases the client may need 
to control a cache's decision about whether to return a cached 
response without validating it. Clients do this using several 
directives of the Cache-Control header. 

A client's request may specify the maximum age it is willing to 
accept of an unvalidated response; specifying a value of zero forces 
the cache(s) to revalidate all responses. A client may also specify 
the minimum time remaining before a response expires. Both of these 
options increase constraints on the behavior of caches, and so cannot 
further relax the cache's approximation of semantic transparency. 

A client may also specify that it will accept stale responses, up to 
some maximum amount of staleness. This loosens the constraints on the 
caches, and so may violate the origin server's specified constraints 
on semantic transparency, but may be necessary to support 
disconnected operation, or high availability in the face of poor 
connectivity. 

13.2 Expiration Model 

13.2.1 Server-Specified Expiration 

HTTP caching works best when caches can entirely avoid making 
requests to the origin server. The primary mechanism for avoiding 
requests is for an origin server to provide an explicit expiration 
time in the future, indicating that a response may be used to satisfy 
subsequent requests. In other words, a cache can return a fresh 
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response without first contacting the server. 

Our expectation is that servers will assign future explicit 
expiration times to responses in the belief that the entity is not 
likely to change, in a semantical ly significant way, before the 
expiration time is reached. This normally preserves semantic 
transparency, as long as the server's expiration times are carefully 
chosen. 

The expiration mechanism applies only to responses taken from a cache 
and not to first-hand responses forwarded immediately to the 
requesting client. 

If an origin server wishes to force a semantical ly transparent cache 
to validate every request, it may assign an explicit expiration time 
in the past. This means that the response is always stale, and so the 
cache SHOULD validate it before using it for subsequent requests. See 
section 14.9.4 for a more restrictive way to force revalidation. 

If an origin server wishes to force any HTTP/1.1 cache, no matter how 
it is configured, to validate every request; it should use the 
"must-reval idate" Cache-Control directive (see section 14.9). 

Servers specify explicit expiration times using either the Expires 
header, or the max-age directive of the Cache-Control header. 

An expiration time cannot be used to force a user agent to refresh 
its display or reload a resource; its semantics apply only to caching 
mechanisms, and such mechanisms need only check a resource* s 
expiration status when a new request for that resource is initiated. 
See section 13.13 for explanation of the difference between caches 
and history mechanisms. 

13.2.2 Heuristic Expiration 

Since origin servers do not always provide explicit expiration times, 
HTTP caches typically assign heuristic expiration times, employing 
algorithms that use other header values (such as the Last-Modified 
time) to estimate a plausible expiration time. The HTTP/1.1 
specification does not provide specific algorithms, but does impose 
worst-case constraints on their results. Since heuristic expiration 
times may compromise semantic transparency, they should be used 
cautiously, and we encourage origin servers to provide expl ici t 
expiration times as much as possible. 
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13.2.3 Age Calculations 

In order to know if a cached entry is fresh, a cache needs to know if 
its age exceeds its freshness lifetime. We discuss how to calculate 
the latter in section 13.2.4; this section describes how to calculate 
the age of a response or cache entry. . 

In this discussion, we use the term "now" to mean "the current value 
of the clock at the host performing the calculation/' Hosts that use 
HTTP, but especially hosts running origin servers and caches, should 
use NTP [28] or some similar protocol to synchronize their clocks to 
a globally accurate time standard. 

Also note that HTTP/1.1 requires origin servers to send a Date header 
with every response, giving the time at which the response was 
generated. We use the term "date_value" to denote the value of the 
Date header, in a form appropriate for arithmetic operations. 

HTTP/1.1 uses the Age response-header to help convey age information 
between caches. The Age header value is the sender's estimate of the 
amount of time since the response was generated at the origin server. 
In the case of a cached response that has been reval i dated wi th the 
origin server, the Age value is based on the time of revalidation, 
not of the original response. 

In essence, the Age value is the sum of the time that the response 
has been resident in each of the caches along the path from the 
origin server, plus the amount of time it has been in transit along 
network paths. 

We use the term "age_value" to denote the value of the Age header, in 
a form appropriate for arithmetic operations. 

A response's age can be calculated in two entirely independent ways: 

1. now minus date_value, if the local clock is reasonably well 
synchronized to the origin server's clock. If the result is 
negative, the result is replaced by zero. 

2. age_value, if all of the caches along the response path 
implement HTTP/1. 1. 

Given that we have two independent ways to compute the age of a 
response when it is received, we can combine these as 

corrected_received_age = max(now - date_value, age_value) 

and as long as we have either nearly synchronized clocks or all- 
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HTTP/1. 1 paths, one gets a reliable (conservative) result. 

Note that this correction is applied at each HTTP/1.1 cache along the 
path, so that if there is an HTTP/1.0 cache in the path, the correct 
received age is computed as long as the receiving cache's clock is 
nearly in sync. We don't need end-to-end clock synchronization 
(although it is good to have), and there is no explicit clock 
synchronization step. 

Because of network- imposed delays, some significant interval may pass 
from the time that a server generates a response and the time it is 
received at the next outbound cache or client. If uncorrected, this 
delay could result in improperly low ages. 

Because the request that resulted in the returned Age value must have 
been initiated prior to that Age value's generation, we can correct 
for delays imposed by the network by recording the time at which the 
request was initiated. Then, when an Age value is received, it MUST 
be interpreted relative to the time the request was initiated, not 
the time that the response was received. This algorithm results in 
conservative behavior no matter how much delay is experienced. So, we 
compute: 

corrected_ini t ial_age = corrected_received_age 

+ (now - request_time) 

where "request_t ime" is the time (according to the local clock) when 
the request that elicited this response was sent. 

Summary of age calculation algorithm, when a cache receives a 
response: 

/* 

* age_value 

* is the value of Age: header received by the cache with 

* this response. 

* date_value 

* is the value of the origin server's Date: header 

* request_time 

* is the (local) time when the cache made the request 

* that resulted in this cached response 

* response__t ime 

* is the (local) time when the cache received the 

* response 

* now 

* is the current (local) time 
V 

apparent_age = max(0, response_t ime - date_value) ; 
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corrected_received_age = max(apparent_age, age_value) ; 

response_delay = response_t ime - request_time; 

corrected_ini t ial_age = corrected_received_age + response_delay; 

resident_t ime = now - response_t ime; 

current_age = corrected_ini tial_age + resident_t ime; 

When a cache sends a response, it must add to the 
corrected_ini tial_age the amount of time that the response was 
resident locally. It must then transmit this total age, using the Age 
header, to the next recipient cache. 

Note that a client cannot reliably tell that a response is first- 
hand, but the presence of an Age header indicates that a response 
is definitely not first-hand. Also, if the Date in a response is 
earlier than the client's local request time, the response is 
probably not first-hand (in the absence of serious clock skew). 

13.2.4 Expiration Calculations 

In order to decide whether a response is fresh or stale, we need to 
compare its freshness lifetime to its age. The age is calculated as 
described in section 13.2.3; this section describes how to calculate 
the freshness lifetime, and to determine if a response has expired. 
In the discussion below, the values can be represented in any form 
appropriate for arithmetic operations. 

We use the term "expires_value" to denote the value of the Expires 
header. We use the term "max_age_value" to denote an appropriate 
value of the number of seconds carried by the max-age directive of 
the Cache-Control header in a response (see section 14.10. 

The max-age directive takes priority over Expires, so if max-age is 
present in a response, the calculation is simply: 

f reshness_l ifet ime = max_age_va 1 ue 

Otherwise, if Expires is present in the response, the calculation is: 

freshness_l ifet ime = expires_value - date_value 

Note that neither of these calculations is vulnerable to clock skew, 
since all of the information comes from the origin server. 

If neither Expires nor Cache-Control: max-age appears in the 
response, and the response does not include other restrictions on 
caching, the cache MAY compute a freshness lifetime using a 
heuristic. If the value is greater than 24 hours, the cache must 
attach Warning 13 to any response whose age is more than 24 hours if 
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such warning has not already been added. 

Also, if the response does have a Last-Modified time, the heuristic 
expiration value SHOULD be no more than some fraction of the interval 
since that time. A typical setting of this fraction might be 10%. 

The calculation to determine if a response has expired is quite 
simple: 

response_is_fresh = (freshness_l ifet ime > current_age) 

13.2.5 Disambiguating Expiration Values 

Because expiration values are assigned optimistically, it is possible 
for two caches to contain fresh values for the same resource that are 
different. 

If a client performing a retrieval receives a non-f i rst-hand response 
for a request that was already fresh in its own cache, and the Date 
header in its existing cache entry is newer than the Date on the new 
response, then the client MAY ignore the response; ■ -If so, it MAY 
retry the request with a "Cache-Control: max-age=0" directive (see 
section 14.9), to force a check with the origin server. 

If a cache has two fresh responses for the same representation with 
different validators, it MUST use the one with the more recent Date 
header. This situation may arise because the cache is pooling 
responses from other caches, or because a client has asked for a 
reload or a revalidation of an apparently fresh cache entry. 

13.2.6 Disambiguating Multiple Responses 

Because a client may be receiving responses via multiple paths, so 
that some responses flow through one set of caches and other 
responses flow through a different set of caches, a client may 
receive responses in an order different from that in which the origin 
server sent them. We would like the client to use the most recently 
generated response, even if older responses are still apparently 
fresh. 

Neither the entity tag nor the expiration value can impose an 
ordering on responses, since it is possible that a later response 
intentionally carries an earlier expiration time. However, the 
HTTP/1.1 specification requires the transmission of Date headers on 
every response, and the Date values are ordered to a granularity of 
one second. 
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When a client tries to revalidate a cache entry, and the response it 
receives contains a Date header that appears to be older than the one 
for the existing entry, then the client SHOULD repeat the request 
unconditionally, and include 

Cache-Con t r o 1 : max-age=0 

to force any intermediate caches to validate their copies directly 
with the origin server, or 

Cache-Control: no-cache 

to force any intermediate caches to obtain a new copy from the origin 
server. 

If the Date values are equal, then the client may use either response 
(or may, if it is being extremely prudent, request a new response). 
Servers MUST NOT depend on clients being able to choose 
deterministical ly between responses generated during the same second, 
if their expiration times overlap. 

13.3 Validation Model 

When a cache has a stale entry that it would like to use as a 
response to a client's request, it first has to check with the origin 
server (or possibly an intermediate cache with a fresh response) to 
see if its cached entry is still usable. We call this "validating" 
the cache entry. Since we do not want to have to pay the overhead of 
retransmitting the full response if the cached entry is good, and we 
do not want to pay the overhead of an extra round trip if the cached 
entry is invalid, the HTTP/1. 1 protocol supports the use of 
conditional methods. 

The key protocol features for supporting conditional methods are 
those concerned with "cache validators." When an origin server: 
generates a full response, it attaches some sort of validator to it, 
which is kept with the cache entry. When a client (user agent or 
proxy cache) makes a conditional request for a resource for which it 
has a cache entry, it includes the associated validator in the 
request. 

The server then checks that validator against the current validator 
for the entity, and, if they match, it responds with a special status 
code (usually, 304 (Not Modified)) and no entity-body. Otherwise, it 
returns a full response (including entity-body). Thus, we avoid 
transmitting the full response if the validator matches, and we avoid 
an extra round trip if it does not match. 
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Note: the comparison functions used to decide if validators match 
are defined in section 13.3.3. 

In HTTP/1.1, a conditional request looks exactly the same as a normal 
request for the same resource, except that it carries a special 
header (which includes the validator) that implicitly turns the 
method (usually, GET) into a conditional. 

The protocol includes both positive and negative senses of cache- 
validating conditions. That is, it is possible to request either that 
a method be performed if and only if a validator matches or if and 
only if no validators match. 

Note: a response that lacks a validator may still be cached, and 
served from cache until it expires, unless this is explicitly 
prohibited by a Cache-Control directive. However, a cache cannot do 
a conditional retrieval if it does not have a validator for the 
entity, which means it will not be refreshable after it expires. 

13.3.1 Last-modified Dates 

The Last-Modified entity-header field value is often used as a cache 
validator. In simple terms, a cache entry is considered to be valid 
if the entity has not been modified since the Last-Modified value. 

13.3.2 Entity Tag Cache Validators 

The ETag entity-header field value, an entity tag, provides for an 
"opaque" cache validator. This may allow more reliable validation in 
situations where it is inconvenient to store modification dates, 
where the one-second resolution of HTTP date values is not 
sufficient, or where the origin server wishes to avoid certain 
paradoxes that may arise from the use of modification dates. 

Entity Tags are described in section 3.11. The headers used with 
entity tags are described in sections 14.20, 14.25, 14.26 and 14.43. 

13.3.3 Weak and Strong Validators 

Since both origin servers and caches will compare two validators to 
decide if they represent the same or different entities, one normally 
would expect that if the entity (the entity 7 body or any entity- 
headers) changes in any way, then the associated validator would 
change as well. If this is true, then we call this validator a 
"strong validator." 

However, there may be cases when a server prefers to change the 
validator only on semantically significant changes, and not when 
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insignificant aspects of the entity change. A validator that does not 
always change when the resource changes is a "weak validator." . 

Entity tags are normally "strong val idators, "but the protocol 
provides a mechanism to tag an entity tag as "weak." One can think of 
a strong validator as one that changes whenever the bits of an entity 
changes, while a weak value changes whenever the meaning of an entity 
changes. Alternatively, one can think of a strong validator as part 
of an identifier for a specific entity, while a weak validator is 
part of an identifier for a set of semantically equivalent entities. 

Note: One example of a strong validator is an integer that is 
incremented in stable storage every time an entity is changed. 

An entity's modification time, if represented with one-second 
resolution, could be a weak validator, since it is possible that 
the resource may be modified twice during a single second. 

Support for weak validators is optional; however, weak validators 
allow for more efficient caching of equivalent objects; for 
example, a hit counter on a site is probably good enough if it is 
updated every few days or weeks, and any value during that period 
is likely "good enough" to be equivalent. 

A "use" of a validator is either when a client generates a request 
and includes the validator in a validating header field, or when a 
server compares two validators. 

Strong validators are usable in any context. Weak validators are only 
usable in contexts that do not depend on exact equality of an entity. 
For example, either kind is usable for a conditional GET of a full 
entity. However, only a strong validator is usable for a sub-range 
retrieval, since otherwise the client may end up with an internally 
inconsistent entity. 

The only function that the HTTP/1. 1 protocol defines on validators is 
comparison. There are two validator comparison functions, depending 
on whether the comparison context allows the use of weak validators 
or not: 

o The strong comparison function: in order to be considered equal, 
both validators must be identical in every way, and neither may be 
weak. 

o The weak comparison function: in order to be considered equal, both 
validators must be identical in every way, but either or both of 
them may be tagged as "weak" without affecting the result. 

The weak comparison function MAY be used for simple (non-subrange) 
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GET requests. The strong comparison function MUST be used in all 
other cases. 

An entity tag is strong unless it is explicitly tagged as weak. 
Section 3.11 gives the syntax for entity tags. 

A Last-Modified time, when used as a validator in a request, is 
implicitly weak unless it is possible to deduce that it is strong, 
using the following rules: 

o The validator is being compared by an origin server to the actual 
current validator for the entity and, m 

o That origin server reliably knows that the associated entity did 
not change twice during the second covered by the presented 
validator. 

o The validator is about to be used by a client in an If-Modified- 

Since or If-Unmodif ied-Since header, because the client has a cache 

entry for the associated entity, and 
o That cache entry includes a Date value, which gives the time when 

the origin server sent the original response, and 
o The presented Last-Modified time is at least 60 seconds before the 

Date value. 

o The validator is being compared by an intermediate cache to the 
validator stored in its cache entry for the entity, and 

o That cache entry includes a Date value, which gives the time when 
the origin server sent the original response, and 

o The presented Last-Modified time is at least 60 seconds before the 
Date value. 

This method relies on the fact that if two different responses were 
sent by the origin server during the same second, but both had the 
same Last-Modified time, then at least one of those responses would 
have a Date value equal to its Last-Modified time. The arbitrary 60- 
second limit guards against the possibility that the Date and Last- 
Modified values are generated from different clocks, or at somewhat 
different times during the preparation of the response. An 
implementation may use a value larger than 60 seconds, if it is 
believed that 60 seconds is too short. 

If a client wishes to perform a sub-range retrieval on a value for 
which it has only a Last-Modified time and no opaque validator, it 
may do this only if the Last-Modified time is strong in the sense 
described here. 
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A cache or origin server receiving a cache-conditional request, other 
than a full-body GET request, MUST use the strong comparison function 
to evaluate the condition. 

These rules allow HTTP/1.1 caches and clients to safely perform sub- 
range retrievals on values that have been obtained from HTTP/1.0 
servers. 

13.3.4 Rules for When to Use Entity Tags and Last-modified Dates 

We adopt a set of rules and recommendations for origin servers, 
clients, and caches regarding when various validator types should be 
used, and for what purposes. 

HTTP/1.1 origin servers: 

o SHOULD send an entity tag validator unless it is not feasible to 
generate one. 

o MAY send a weak entity tag instead of a strong entity tag, if 

performance considerations support the use of weak entity tags, or 
if it is unfeasible to send a strong entity tag: 

o SHOULD send a Last-Modified value if it is feasible to send one, 
unless the risk of a breakdown in semantic transparency that could 
result from using this date in an If-Modi f ied-Since header would 
lead to serious problems. 

In other words, the preferred behavior for an HTTP/1.1 origin server 
is to send both a strong entity tag and a Last-Modified value. 

In order to be legal, a strong entity tag MUST change whenever the 
associated entity value changes in any way. A weak entity tag SHOULD 
change whenever the associated entity changes in a semantical ly 
significant way. 

Note: in order to provide semantical ly transparent caching, an 
origin server must avoid reusing a specific strong entity tag value 
for two different entities, or reusing a specific weak entity tag 
value for two semantically different entities. Cache entries may 
persist for arbitrarily long periods, regardless of expiration 
times, so it may be inappropriate to expect that a cache will never 
again attempt to validate an entry using a validator that it 
obtained at some point in the past. 

HTTP/1. 1 clients: 

o If an entity tag has been provided by the origin server, MUST 
use that entity tag in any cache-conditional request (using 
If-Match or If-None-Match) . 
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o If only a Last-Modi f ied value has been provided by the origin 
server, SHOULD use that value in non-subrange cache-conditional 
requests (using If -Mod if ied-Since) . 

o 'If only a Last-Modified value has been provided by an HTTP/1.0 
origin server, MAY use that value in subrange cache-conditional 
requests (using If-Unmodif ied-Since:) . The user agent should 
provide a way to disable this, in case of difficulty. 

o If both an entity tag and a Last-Modified value have been 
provided by the origin server, SHOULD use both validators in 
cache-conditional requests. This allows both HTTP/1.0 and 
HTTP/1. 1 caches to respond appropriately. 

An HTTP/1.1 cache, upon receiving a request, MUST use the most 
restrictive validator when deciding whether the client's cache entry 
matches the cache's own cache entry. This is only an issue when the 
request contains both an entity tag and a last-modi fied-date 
validator (I f-Modif ied-Since or If-Unmodif ied-Since) . 

A note on rationale: The general principle behind these rules is 
that HTTP/1.1 servers and clients should transmit as much non- 
redundant information as is available in their responses and 
requests. HTTP/1.1 systems receiving this information will make the 
most conservative assumptions about the validators they receive. 

HTTP/1.0 clients and caches will ignore entity tags. Generally, 
last-modified values received or used by these systems will support 
transparent and efficient caching, and so HTTP/1.1 origin servers 
should provide Last-Modified values. In those rare cases where the 
use of a Last-Modified value as a validator by an HTTP/1.0 system 
could result in a serious problem, then HTTP/1.1 origin servers 
should not provide one. 

13.3.5 Non-val idating Conditionals 

The principle behind entity tags is that only the service author 
knows the semantics of a resource well enough to select an 
appropriate cache validation mechanism, and the specification of any 
validator comparison function more complex than byte-equality would 
open up a can of worms. Thus, comparisons of any other headers 
(except Last-Modified, for compatibility with HTTP/1. 0) are never 
used for purposes of validating a cache entry. 

13.4 Response Cachability 

Unless specifically constrained by a Cache-Control (section 14.9) 
directive, a caching system may always store a successful response 
(see section 13.8) as a cache entry, may return it without validation 
if it is fresh, and may return it after successful validation. If 
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there is neither a cache validator nor an explicit expiration time 
associated with a response, we do not expect it to be cached, but 
certain caches may violate this expectation (for example, when little 
or no network connectivity is available). A client can usually detect 
that such a response was taken from a cache by comparing the Date 
header to the current time. 

Note that some HTTP/1.0 caches are known to violate this 
expectation without providing any Warning. 

However, in some cases it may be inappropriate for a cache to retain 
an entity, or to return it in response to a subsequent request. This 
may be because absolute semantic transparency is deemed necessary by 
the service author, or because of security or privacy considerations. 
Certain Cache-Control directives are therefore provided so that the 
server can indicate that certain resource entities, or portions 
thereof, may not be cached regardless of other considerations. 

Note that section 14.8 normally prevents a shared cache from saving 
and returning a response to a previous request if that request 
included an Authorization header. 

A response received with a status code of 200, 203, 206, 300, 301 or 
410 may be stored by a cache and used in reply to a subsequent 
request, subject to the expiration mechanism, unless a Cache-Control 
directive prohibits caching. However, a cache that does not support 
the Range and Content-Range headers MUST NOT cache 206 (Partial 
Content) responses. 

A response received with any other status code MUST NOT be returned 
in a reply to a subsequent request unless there are Cache-Control 
directives or another header (s) that explicitly allow it. For 
example, these include the following: an Expires header (section 
14.21); a "max-age", "must-reval idate", "proxy-reval idate", "public" 
or "private" Cache-Control directive (section 14.9). 

13.5 Constructing Responses From Caches 

The purpose of an HTTP cache is to store information received in 
response to requests, for use in responding to future requests. In 
many cases, a cache simply returns the appropriate parts of a 
response to the requester. However, if the cache holds a cache entry 
based on a previous response, it may have to combine parts of a new 
response with what is held in the cache entry. 
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13. 5. 1 End-to-end and Hop-by-hop Headers 

For the purpose of defining the behavior of caches and non-caching 
proxies, we divide HTTP headers into two categories: 

o End-to-end headers, which must be transmitted to the 
ultimate recipient of a request or response. End-to-end 
headers in responses must be stored as part of a cache entry 
and transmitted in any response formed from a cache entry. 

o Hop-by-hop headers, which are meaningful only for a single 
transport- level connection, and are not stored by caches or 
forwarded by proxies. 

The following HTTP/1.1 headers are hop-by-hop headers: 

o Connection 

o Keep-Al ive 

o Public 

o Proxy-Authenticate 

o Transfer-Encoding 

o Upgrade 

All other headers defined by HTTP/1. 1 are end-to-end headers. 

Hop-by-hop headers introduced in future versions of HTTP MUST be 
listed in a Connection header, as described in section 14.10. 

13.5.2 Non-modifiable Headers 

Some features of the HTTP/1.1 protocol, such as Digest 
Authentication, depend on the value of certain end-to-end headers. A 
cache or non-caching proxy SHOULD NOT modify an end-to-end header 
unless the definition of that header requires or specifically allows 
that. 

A cache or non-caching proxy MUST NOT modify any of the following 
fields in a request or response, nor may it add any of these fields 
if not already present: 

o Con tent -Location 

o ETag 

o Expires 

o Last-Modified 
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A cache or non-caching proxy MUST NOT modify or add any of the 
following fields in a response that contains the no-transform Cache- 
Control directive, or in any request: 

o Con t en t -Encod i ng 
o Content-Length 
o Con tent -Range 
o Content-Type 

A cache or non-caching proxy MAY modify or add these fields in a 
response that does not include no-transform, but if it does so, it 
MUST add a Warning 14 (Transformation applied) if one does not 
already appear in the response. 

Warning: unnecessary modification of end-to-end headers may cause 
authentication failures if stronger authentication mechanisms are 
introduced in later versions of HTTP. Such authentication 
mechanisms may rely on the values of header fields not listed here. 

13.5.3 Combining Headers 

When a cache makes a validating request to a server, and the server 
provides a 304 (Not Modified) response, the cache must construct a 
response to send to the requesting client. The cache uses the 
entity-body stored in the cache entry as the entity-body of this 
outgoing response. The end-to-end headers stored in the cache entry 
are used for the constructed response, except that any end-to-end 
headers provided in the 304 response MUST replace the corresponding 
headers from the cache entry. Unless the cache decides to remove the 
cache entry, it MUST also replace the end-to-end headers stored with 
the cache entry with corresponding headers received in the incoming 
response. 

In other words, the set of end-to-end headers received in the 
incoming response overrides all corresponding end-to-end headers 
stored with the cache entry. The cache may add Warning headers (see 
section 14.45) to this set. 

If a header field-name in the incoming response matches more than one 
header in the cache entry, all such old headers are replaced. 

Note: this rule allows an origin server to use a 304 (Not Modified) 
response to update any header associated with a previous response 
for the same entity, although it might not always be meaningful or 
correct to do so. This rule does not allow an origin server to use 
a 304 (not Modified) response to entirely delete a header that it 
had provided with a previous response. 
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13.5.4 Combining Byte Ranges 

A response may transfer only a subrange of the bytes of an entity- 
body, either because the request included one or more Range 
specifications, or because a connection was broken prematurely. After 
several such transfers, a cache may have received several ranges of 
the same entity-body. 

If a cache has a stored non-empty set of subranges for an entity, and 
an incoming response transfers another subrange, the cache MAY 
combine the new subrange with the existing set if both the following 
conditions are met: 

o Both the incoming response and the cache entry must have a cache 
validator. 

o The two cache validators must match using the strong comparison 
function (see section 13.3.3). 

If either requirement is not meant, the cache must use only the most 
recent partial response (based on the Date values transmitted with 
every response, and using the incoming response if these values are 
equal or missing), and must discard the other partial information. 

13.6 Caching Negotiated Responses 

Use of server-driven content negotiation (section 12), as indicated 
by the presence of a Vary header field in a response, alters the 
conditions and procedure by which a cache can use the response for 
subsequent requests. 

A server MUST use the Vary header field (section 14.43) to inform a 
cache of what header field dimensions are used to select among 
multiple representations of a cachable response. A cache may use the 
selected representation (the entity included with that particular 
response) for replying to subsequent requests on that resource only 
when the subsequent requests have the same or equivalent values for 
all header fields specified in the Vary response-header. Requests 
with a different value for one or more of those header fields would 
be forwarded toward the origin server. 

If an entity tag was assigned to the representation, the forwarded 
request SHOULD be conditional and include the entity tags in an If- 
None-Match header field from all its cache entries for the Request- 
URI. This conveys to the server the set of entities currently held by 
the cache, so that if any one of these entities matches the requested 
entity, the server can use the ETag header in its 304 (Not Modified) 
response to tell the cache which entry is appropriate. If the 
entity- tag of the new response matches that of an existing entry, the 
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new response SHOULD be used to update the header fields of the 
existing entry, and the result MUST be returned to the client. 

The Vary header field may also inform the cache that the 
representation was selected using criteria not limited to the 
request-headers; in this case, a cache MUST NOT use the response in a 
reply to a subsequent request unless the cache relays the new request 
to the origin server in a conditional request and the server responds 
with 304 (Not Modified), including an entity tag or Content-Location 
that indicates which entity should be used. 

If any of the existing cache entries contains only partial content 
for the associated entity, its entity-tag SHOULD NOT be included in 
the If-None-Match header unless the request is for a range that would 
be fully satisfied by that entry. 

If a cache receives a successful response whose Content-Location 
field matches that of an existing cache entry for the same Request- 
URI, whose entity-tag differs from that of the existing entry, and 
whose Date is more recent than that of the existing entry, the 
existing entry SHOULD NOT be returned in response to future requests, 
and should be deleted from the cache. 

13.7 Shared and Non-Shared Caches 

For reasons of security and privacy, it is necessary to make a 
distinction between "shared" and "non-shared" caches. A non-shared 
cache is one that is accessible only to a single user. Accessibility 
in this case SHOULD be enforced by appropriate security mechanisms. 
All other caches are considered to be "shared." Other sections of 
this specification place certain constraints on the operation of 
shared caches in order to prevent loss of privacy or failure of 
access controls. 

13.8 Errors or Incomplete Response Cache Behavior 

A cache that receives an incomplete response (for example, with fewer 
bytes of data than specified in a Content-Length header) may store 
the response. However, the cache MUST treat this as a partial 
response. Partial responses may be combined as described in section 
13.5.4; the result might be a full response or might still be 
partial. A cache MUST NOT return a partial response to a client 
without explicitly marking it as such, using the 206 (Partial 
Content) status code. A cache MUST NOT return a partial response 
using a status code of 200 (OK). 

If a cache receives a 5xx response while attempting to revalidate an 
entry, it may either forward this response to the requesting client, 



Fielding, et. al. 



Standards Track 



[Page 91] 



.RFC 2068 



HTTP/1 . 1 



January 1997 



or act as if the server failed to respond. In the latter case, it MAY 
return a previously received response unless the cached entry 
includes the "must-reval idate" Cache-Control directive (see section 
14.9). 

13.9 Side Effects of GET and HEAD 

Unless the origin server explicitly prohibits the caching of their 
responses, the application of GET and HEAD methods to any resources 
SHOULD NOT have side effects that would lead to erroneous behavior if 
these responses are taken from a cache. They may still have side 
effects, but a cache is not required to consider such side effects in 
its caching decisions. Caches are always expected to observe an 
origin server's explicit restrictions on caching. 

We note one exception to this rule: since some applications have 
traditionally used GETs and HEADs with query URLs (those containing a 
"?" in the reljath part) to perform operations with significant side 
effects, caches MUST NOT treat responses to such URLs as fresh unless 
the server provides an explicit expiration time. This specifically 
means that responses from FTITP/l.O servers for such URIs should not 
be taken from a cache. See section 9.1.1 for related information. 

13.10 Invalidation After Updates or Deletions 

The effect of certain methods at the origin server may cause one or 
more existing cache entries to become non- transparent ly invalid. That 
is, although they may continue to be "fresh," they do not accurately 
reflect what the origin server would return for a new request. 

There is no way for the HTTP protocol to guarantee that all such 
cache entries are marked invalid. For example, the request that 
caused the change at the origin server may not have gone through the 
proxy where a cache entry is stored. However, several rules help 
reduce the likelihood of erroneous behavior. 

In this section, the phrase "invalidate an entity" means that the 
cache should either remove all instances of that entity from its 
storage, or should mark these as "invalid" and in need of a mandatory 
revalidation before they can be returned in response to a subsequent 
request. 
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Some HTTP methods may invalidate an entity. This is either the entity 
referred to by the Request-URI, or by the Location or Content- 
Location response-headers (if present). These methods are: 

o PUT 
o DELETE 
o POST 

In order to prevent denial of service attacks, an invalidation based 
on the URI in a Location or Content-Location header MUST only be 
performed if the host part is the same as in the Request-URI. 

13.11 Write-Through Mandatory 

All methods that may be expected to cause modifications to the origin 
server's resources MUST be written through to the origin server. This 
currently includes all methods except for GET and HEAD. A cache MUST 
NOT reply to such a request from a client before having transmitted 
the request to the inbound server, and having received a 
corresponding response from the inbound server. This does not prevent 
a cache from sending a 100 (Continue) response before theinbound 
server has replied. 

The alternative (known as "wri te-back" or "copy-back" caching) is not 
allowed in HTTP/1.1, due to the difficulty of providing consistent 
updates and the problems arising from server, cache, or network 
failure prior to write-back. 

13.12 Cache Replacement 

If a new cachable (see sections 14.9.2, 13.2.5, 13.2.6 and 13.8) 
response is received from a resource while any existing responses for 
the same resource are cached, the cache SHOULD use the new response 
to reply to the current request. It may insert it into cache storage 
and may, if it meets all other requirements, use it to respond to any 
future requests that would previously have caused the old response to 
be returned. If it inserts the new response into cache storage it 
should follow the rules in section 13.5.3. 

Note: a new response that has an older Date header value than 
existing cached responses is not cachable. 

13. 13 History Lists 

User agents often have history mechanisms, such as "Back" buttons and 
history lists, which can be used to redisplay an entity retrieved 
earlier in a session. 
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History mechanisms and caches are different. In particular history 
mechanisms SHOULD NOT try to show a semantical ly transparent view of 
the current state of a resource. Rather, a history mechanism is meant 
to show exactly what the user saw at the time when the resource was 
retrieved. 

By default, an expiration time does not apply to history mechanisms. 
If the entity is still in storage, a history mechanism should display 
it even if the entity has expired, unless the user has specifically 
configured the agent to refresh expired history documents. 

This should not be construed to prohibit the history mechanism from 
telling the user that a view may be stale. 

Note: if history list mechanisms unnecessarily prevent users from 
viewing stale resources, this will tend to force service authors to 
avoid using HTTP expiration controls and cache controls when they 
would otherwise like to. Service authors may consider it important 
that users not be presented with error messages or warning messages 
when they use navigation controls (such as BACK) to view previously 
- fetched resources. Even though sometimes such resources ought not - 
to cached, or ought to expire quickly, user interface 
considerations may force service authors to resort to other means 
of preventing caching (e.g. "once-only" URLs) in order not to 
suffer the effects of improperly functioning history mechanisms. 

14 Header Field Definitions 

This section defines the syntax and semantics of all standard 
HTTP/1.1 header fields. For entity-header fields, both sender and 
recipient refer to either the client or the server, depending on who 
sends and who receives the entity. 
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14. 1 Accept 

The Accept request-header field can be used to specify certain media 
types which are acceptable for the response. Accept headers can be 
used to indicate that the request is specifically limited to a small 
set of desired types, as in the case of a request for an in-line 
image. 

Accept = "Accept" ":" 

#( media- range [ accept -par ams ] ) 

media-range = ( "*/*" 

I ( type V" ) 

I ( type "/" subtype ) 

) *( ";" parameter ) 

accept-params = ";" "q" "=" qvalue *( accept-extension ) 

accept-extension = ";" token [ "=" ( token I quoted-string ) ] 

The asterisk "*" character is used to group media types into ranges, 
with "*/*" indicating all media types and "type/*" indicating all 
subtypes of that type. The media-range MAY include media type 
parameters that are applicable to that range. 

Each media-range MAY be followed by one or more accept-params, 
beginning with the "q" parameter for indicating a relative quality 
factor. The first "q" parameter (if any) separates the media-range 
parameter(s) from the accept-params. Quality factors allow the user 
or user agent to indicate the relative degree of preference for that 
media-range, using the qvalue scale from 0 to 1 (section 3.9). The 
default value is q=l. 

Note: Use of the "q" parameter name to separate media type 
parameters from Accept extension parameters is due to historical 
practice. Although this prevents any media type parameter named 
"q" from being used with a media range, such an event is believed 
to be unlikely given the lack of any "q" parameters in the IANA 
media type registry and the rare usage of any media type parameters 
in Accept. Future media types should be discouraged from 
registering any parameter named "q". 

The example 

Accept: audio/*; q=0. 2, audio/basic 

SHOULD be interpreted as "I prefer audio/basic, but send me any audio 
type if it is the best available after an 80% mark-down in quality." 
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If no Accept header field is present, then it is assumed that the 
client accepts all media types. If an Accept header field is present, 
and if the server cannot send a response which is acceptable 
according to the combined Accept field value, then the server SHOULD 
send a 406 (not acceptable) response. 

A more elaborate example is 

Accept: text/plain; q=0. 5, text/html, 
text/x-dvi; q=0.8, text/x-c 

Verbally, this would be interpreted as "text/html and text/x-c are 
the preferred media types, but if they do not exist, then send the 
text/x-dvi entity, and if that does not exist, send the text/plain 
ent i ty. " 

Media ranges can be overridden by more specific media ranges or 
specific media types. If more than one media range applies to a given 
type, the most specific reference has precedence. For example, 

Accept: text/*, text/html, text /html ; level-l, */* 

have the following precedence: 

1) text/html; 1 eve 1=1 

2) text/html 

3) text/* 

4) */* 

The media type quality factor associated with a given type is 
determined by finding the media range with the highest precedence 
which matches that type. For example, 

Accept: text/*;q=0.3, text/html ;q=0. 7, text/html ; level =1, 
text/html ; level =2 ;q=0. 4, */*;q=0.5 

would cause the following values to be associated: 



text/html ; leveUl = 1 

text /html = 0. 7 

text/plain = 0. 3 

image/jpeg = 0. 5 

text/html ; level =2 - 0.4 

text/html; level =3 =0.7 



Note: A user agent may be provided with a default set of quality 
values for certain media ranges. However, unless the user agent is 
a closed system which cannot interact with other rendering agents, 
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this default set should be configurable by the user. 

14.2 Accept -Char set 

The Accept-Charset request -header field can be used to indicate what 
character sets are acceptable for the response. This field allows 
clients capable of understanding more comprehensive or special- 
purpose character sets to signal that capability to a server which is 
capable of representing documents in those character sets. The ISO- 
8859-1 character set can be assumed to be acceptable to all user 
agents. 



Character set values are described in section 3.4. Each charset may 
be given an associated quality value which represents the user's 
preference for that charset. The default value is q=l. An example is 

Accept-Charset: iso-8859-5, unicode-1-1 ;q=0. 8 

If no Accept-Charset header is present, the default is that any 
character set is acceptable. If an Accept-Charset header is present, 
and if the server cannot send a response which is acceptable 
according to the Accept-Charset header, then the server SHOULD send 
an error response with the 406 (not acceptable) status code, though 
the sending of an unacceptable response is also allowed. 

14.3 Accept-Encoding 

The Accept-Encoding request -header field is similar to Accept, but 
restricts the con tent -coding values (section 14.12) which are 
acceptable in the response. 



An example of its use is 

Accept-Encoding: compress, gz ip 

If no Accept-Encoding header is present in a request, the server MAY 
assume that the client will accept any content coding. If an Accept- 
Encoding header is present, and if the server cannot send a response 
which is acceptable according to the Accept-Encoding header, then the 
server SHOULD send an error response with the 406 (Not Acceptable) 
status code. 



Accept-Charset = "Accept-Charset 




Accept-Encoding = "Accept-Encoding 1 



#( con tent -coding ) 
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An empty Accept-Encoding value indicates none are acceptable. 

14.4 Accept -Language 

The Accept-Language request -header field is similar to Accept, but 
restricts the set of natural languages that are preferred as a 
response to the request. 



language-range = ( ( 1*8ALPHA *( 1*8ALPHA ) ) I ) 

Each language-range MAY be given an associated quality value which 
represents an estimate of the user's preference for the languages 
specified by that range. The quality value defaults to "q=l". For 
example, 

Accept-Language: da, en-gb;q=0.8, en;q=0. 7 

would mean: "I prefer Danish, but will accept British English and 
other types of English." A language- range matches a language-tag if 
it exactly equals the tag, or if it exactly equals a prefix of the 
tag such that the first tag character following the prefix is "-". 
The special range "*", if present in the Accept-Language field, 
matches every tag not matched by any other range present in the 
Accept-Language field. 

Note: This use of a prefix matching rule does not imply that 
language tags are assigned to languages in such a way that it is 
always true that if a user understands a language with a certain 
tag, then this user will also understand all languages with tags 
for which this tag is a prefix. The prefix rule simply allows the 
use of prefix tags if this is the case. 

The language quality factor assigned to a language-tag by the 
Accept-Language field is the quality value of the longest language- 
range in the field that matches the language-tag. If no language- 
range in the field matches the tag, the language quality factor 
assigned is 0. If no Accept-Language header is present in the 
request, the server SHOULD assume that all languages are equally 
acceptable. If an Accept-Language header is present, then all 
languages which are assigned a quality factor greater than 0 are 
acceptable. 

It may be contrary to the privacy expectations of the user to send an 
Accept-Language header with the complete linguistic preferences of 
the user in every request. For a discussion of this issue, see 
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section 15. 7. 

Note: As intelligibility is highly dependent on the individual 
user, it is recommended that client applications make the choice of 
linguistic preference available to the user. If the choice is not 
made available, then the Accept -Language header field must not be 
given in the request. 

14. 5 Accept-Ranges 

The Accept-Ranges response-header field allows the server to indicate 
its acceptance of range requests for a resource: 

Accept-Ranges = "Accept-Ranges" ":" acceptable-ranges 

accept able- ranges = l#range-unit I "none'* 

Origin servers that accept byte-range requests MAY send 

Accept-Ranges: bytes 

but are not required to do so. Clients MAY generate byte-range 
requests without having received this header for the resource 
involved. 

Servers that do not accept any kind of range request for a resource 
MAY send 

Accept-Ranges: none 

to advise the client not to attempt a range request. 

14. 6 Age 

The Age response-header field conveys the sender's estimate of the 
amount of time since the response (or its revalidation) was generated 
at the origin server. A cached response is "fresh" if its age does 
not exceed its freshness lifetime. Age values are calculated as 
specified in section 13.2.3. 

Age = "Age" age-value 

age-value = delta-seconds 

Age values are non-negative decimal integers, representing time in 
seconds. 
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If a cache receives a value larger than the largest positive integer 
it can represent, or if any of its age calculations overflows, it 
MUST transmit an Age header with a value of 2147483648 ■ (2*31) . 
HTTP/1.1 caches MUST send an Age header in every response. Caches 
SHOULD use an arithmetic type of at least 31 bits of range. 

14.7 Allow 

The Allow entity-header field lists the set of methods supported by 
the resource identified by the Request-URL The purpose of this field 
is strictly to inform the recipient of valid methods associated with 
the resource. An Allow header field MUST be present in a 405 (Method 
Not Allowed) response. 

Allow = "Allow" ":" lftnethod 

Example of use: 

Allow: GET, HEAD, PUT 

This field cannot prevent a client from trying other methods. 
However, the indications given by the Allow header field value SHOULD 
be followed. The actual set of allowed methods is defined by the 
origin server at the time of each request. 

The Allow header field MAY be provided with a PUT request to 
recommend the methods to be supported by the new or modified 
resource. The server is not required to support these methods and 
SHOULD include an Allow header in the response giving the actual 
supported methods. 

A proxy MUST NOT modify the Allow header field even if it does not 
understand all the methods specified, since the user agent MAY have 
other means of communicating with the origin server. 

The Allow header field does not indicate what methods are implemented 
at the server level. Servers MAY use the Public response-header field 
(section 14.35) to describe what methods are implemented on the 
server as a whole. 

14.8 Authorization 

A user agent that wishes to authenticate itself with a server — 
usually, but not necessarily, after receiving a 401 response — MAY do 
so by including an Authorization request -header field with the 
request. The Authorization field value consists of credentials 
containing the authentication information of the user agent for the 
realm of the resource being requested. 
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Authorization = "Authorization" credentials 

HTTP access authentication is described in section 11. If a request 
is authenticated and a realm specified, the same credentials SHOULD 
be valid for all other requests within this realm. 

When a shared cache (see section 13.7) receives a request containing 
an Authorization field, it MUST NOT return the corresponding response 
as a reply to any other request, unless one of the following specific 
exceptions holds: 

* 1. If the response includes the "proxy-reval idate" Cache-Control 
directive, the cache MAY use that response in replying to a 
subsequent request, but a proxy cache MUST first revalidate it with 
the origin server, using the request-headers from the new request 
to allow the origin server to authenticate the new request. 

2. If the response includes the "must-reval idate" Cache-Control 
directive, the cache MAY use that response in replying to a 
subsequent request, but all caches MUST first revalidate it with 
the origin server, using the request -headers from the new request 
to allow the origin server to authenticate the new request. 

3. If the response includes the "public" Cache-Control directive, it 
may be returned in reply to any subsequent request. 

14.9 Cache-Control 

The Cache-Control general -header field is used to specify directives 
that MUST be obeyed by all caching mechanisms along the 
request/response chain. The directives specify behavior intended to 
prevent caches from adversely interfering with the request or 
response. These directives typically override the default caching 
algorithms. Cache directives are unidirectional in that the presence 
of a directive in a request does not imply that the same directive 
should be given in the response. 

Note that HTTP/1.0 caches may not implement Cache-Control and may 
only implement Pragma: no-cache (see section 14.32). 

Cache directives must be passed through by a proxy or gateway 
application, regardless of their significance to that application, 
since the directives may be applicable to all recipients along the 
request/response chain. It is not possible to specify a cache- 
directive for a specific cache. 

Cache-Control = "Cache-Control" ":" l#cache-direct ive 

cache-directive = cache-request -direct ive 
I cache-response-directive 
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cache-request-directive = 



"no-cache" [ "=" <"> l#fi eld-name <"> ] 
"no-store" 

"max-age" "=" delta-seconds 
"max-stale" [ "=" delta-seconds ] 
"min-fresh" "=" delta-seconds 
"only- if -cached" 
c ache- ex t ens i on 



cache-response-directive = 
"publ ic" 



"private" [ "=" <"> l#field-name <"> ] 
"no-cache" [ "=" <"> l#field-name <"> ] 
"no-store" 
"no-transform" 
"must-reval idate" 
"proxy-reval idate" 
"max-age" "=" delta-seconds 
cache-extension 



cache-extension = token [ "=" ( token I quoted-string ) ] 

When a directive appears without any l#field-name parameter, the 
directive applies to the entire request or response. When such a 
directive appears with a l#field-name parameter, it applies only to 
the named field or fields, and not to the rest of the request or 
response. This mechanism supports extensibility; implementations of 
future versions of the HTTP protocol may apply these directives to 
header fields not defined in HTTP/1.1. 

The cache-control directives can be broken down into these general 
categories: 

o Restrictions on what is cachable; these may only be imposed by the 
origin server. 

o Restrictions on what may be stored by a cache; these may be imposed 

by either the origin server or the user agent, 
o Modifications of the basic expiration mechanism; these may be 

imposed by either the origin server or the user agent, 
o Controls over cache revalidation and reload; these may only be 

imposed by a user agent, 
o Control over transformation of entities, 
o Extensions to the caching system. 
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14. 9. 1 What is Cachable 

By default, a response is cachable if the requirements of the request 
method, request header fields, and the response status indicate that 
it is cachable. Section 13.4 summarizes these defaults for 
cachability. The following Cache-Control response directives allow an 
origin server to override the default cachability of a response: 

pub lie 

Indicates that the response is cachable by any cache, even if it 
would normally be non-cachable or cachable only within a non-shared 
cache. (See also Authorization, section 14.8, for additional 
details.) 

private 

Indicates that all or part of the response message is intended for a 
single user and MUST NOT be cached by a shared cache. This allows an 
origin server to state that the specified parts of the response are 
intended for only one user and are not a valid response for requests 
by other users. A private (non-shared) cache may cache the response. 

Note: This usage of the word private only controls where the 
response may be cached, and cannot ensure the privacy of the 
message content. 

no-cache 

Indicates that all or part of the response message MUST NOT be cached 
anywhere. This allows an origin server to prevent caching even by 
caches that have been configured to return stale responses to client 
requests. 

Note: Most HTTP/1.0 caches will not recognize or obey this 
directive. 

14. 9. 2 What May be Stored by Caches 

The purpose of the no-store directive is to prevent the inadvertent 
release or retention of sensitive information (for example, on backup 
tapes). The no-store directive applies to the entire message, and may 
be sent either in a response or in a request. If sent in a request, a 
cache MUST NOT store any part of either this request or any response 
to it. If sent in a response, a cache MUST NOT store any part of 
either this response or the request that elicited it. This directive 
applies to both non-shared and shared caches. "MUST NOT store" in 
this context means that the cache MUST NOT intentionally store the 
information in non-volatile storage, and MUST make a best-effort 
attempt to remove the information from volatile storage as promptly 
as possible after forwarding it. 
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Even when this directive is associated with a response, users may 
explicitly store such a response outside of the caching system (e.g., 
with a "Save As" dialog). History buffers may store such responses as 
part of their normal operation. 

The purpose of this directive is to meet the stated requirements of 
certain users and service authors who are concerned about accidental 
releases of information via unanticipated accesses to cache data 
structures. While the use of this directive may improve privacy in 
some cases, we caution that it is NOT in any way a reliable or 
sufficient mechanism for ensuring privacy. In particular, malicious 
or compromised caches may not recognize or obey this directive; and 
communications networks may be vulnerable to eavesdropping. 

14.9.3 Modifications of the Basic Expiration Mechanism 

The expiration time of an entity may be specified by the origin 
server using the Expires header (see section 14.21). Alternatively, 
it may be specified using the max- age directive in a response. 

If a response includes both an Expires header and a max-age 
directive, the max-age directive overrides the Expires header, even 
if the Expires header is more restrictive. This rule allows an origin 
server to provide, for a given response, a longer expiration time to 
an HTTP/1.1 (or later) cache than to an HTTP/1.0 cache. This may be 
useful if certain HTTP/1.0 caches improperly calculate ages or 
expiration times, perhaps due to desynchronized clocks. 

Note: most older caches, not compliant with this specification, do 
not implement any Cache-Control directives. An origin server 
wishing to use a Cache-Control directive that restricts, but does 
not prevent, caching by an HTTP/1. 1-compl i ant cache may exploit the 
requirement that the max-age directive overrides the Expires 
header, and the fact that non-HTTP/L 1-compl i ant caches do not 
observe the max-age directive. 

Other directives allow an user agent to modify the basic expiration 
j mechanism. These directives may be specified on a request: 

max-age 

Indicates that the client is willing to accept a response whose age 
is no greater than the specified time in seconds. Unless max-stale 
directive is also included, the client is not willing to accept a 
stale response. 

min-f resh 

Indicates that the client is willing to accept a response whose 
freshness lifetime is no less than its current age plus the 
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specified time in seconds. That is, the client wants a response 
that will still be fresh for at least the specified number of 
seconds. 

max-stale 

Indicates that the client is willing to accept a response that has 
exceeded its expiration time. If max-stale is assigned a value, 
then the client is willing to accept a response that has exceeded 
its expiration time by no more than the specified number of 
seconds. If no value is assigned to max-stale, then the client is 
willing to accept a stale response of any age. 

If a cache returns a stale response, either because of a max-stale 
directive on a request, or because the cache is configured to 
override the expiration time of a response, the cache MUST attach a 
Warning header to the stale response, using Warning 10 (Response is 
stale) . 

14.9.4 Cache Revalidation and Reload Controls 

Sometimes an user agent may want or need to insist that a cache 
revalidate its cache entry with the origin server (and not just with 
the next cache along the path to the origin server), or to reload its 
cache entry from the origin server. End-to-end revalidation may be 
necessary if either the cache or the origin server has overestimated 
the expiration time of the cached response. End-to-end reload may be 
necessary if the cache entry has become corrupted for some reason. 

End-to-end revalidation may be requested either when the client does 
- not have its own local cached copy, in which case we call it 
"unspecified end-to-end revalidation", or when the client does have a 
local cached copy, in which case we call it "specific end-to-end 
revalidation." 

The client can specify these three kinds of action using Cache- 
Control request directives: 

End-to-end reload 

The request includes a "no-cache" Cache-Control directive or, for 
compatibility with HTTP/1.0 clients, "Pragma: no-cache". No field 
names may be included with the no-cache directive in a request. The 
server MUST NOT use a cached copy when responding to such a 
request. 

Specific end-to-end revalidation 

The request includes a "max-age=0" Cache-Control directive, which 
forces each cache along the path to the origin server to revalidate 
its own entry, if any, with the next cache or server. The initial 
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request includes a cache-val idat ing conditional with the client's 
current val idator. 

Unspecified end-to-end revalidation 
The request includes "max-age=0" Cache-Control directive, which 
forces each cache along the path to the origin server to revalidate 
its own entry, if any, with the next cache or server. The initial 
request does not include a cache-validating conditional; the first 
cache along the path (if any) that holds a cache entry for this 
resource includes a cache-validating conditional with its current 
validator. 

When an intermediate cache is forced, by means of a max-age=0 
directive, to revalidate its own cache entry, and the client has 
supplied its own validator in the request, the supplied validator may 
differ from the validator currently stored with the cache entry. In 
this case, the cache may use either validator in making its own 
request without affecting semantic transparency. 

However, the choice of validator may affect performance. The best 
approach is for the intermediate cache to use its own validator when 
making its request. If the server replies with 304 (Not Modified), 
then the cache should return its now validated copy to the client 
with a 200 (OK) response. If the server replies with a new entity and 
cache validator, however, the intermediate cache should compare the 
returned validator with the one provided in the client's request, 
using the strong comparison function. If the client's validator is 
equal to the origin server's, then the intermediate cache simply 
returns 304 (Not Modified). Otherwise, it returns the new entity with 
a 200 (OK) response. 

If a request includes the no-cache directive, it should not include 
min-fresh, max-stale, or max-age. 

In some cases, such as times of extremely poor network connectivity, 
a client may want a cache to return only those responses that it 
currently has stored, and not to reload or revalidate with the origin 
server. To do this, the client may include the only- if -cached 
directive in a request. If it receives this directive, a cache SHOULD 
either respond using a cached entry that is consistent with the other 
constraints of the request, or respond with a 504 (Gateway Timeout) 
status. However, if a group of caches is being operated as a unified 
system with good internal connectivity, such a request MAY be 
forwarded within that group of caches. 

Because a cache may be configured to ignore a server's specified 
expiration time, and because a client request may include a max-stale 
directive (which has a similar effect), the protocol also includes a 
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mechanism for the origin server to require revalidation of a cache 
entry on any subsequent use. When the must-reval idate directive is 
present in a response received by a cache, that cache MUST NOT use 
the entry after it becomes stale to respond to a subsequent request 
without first revalidating it with the origin server. (I.e. , the 
cache must do an end-to-end revalidation every time, if, based solely 
on the origin server's Expires or max-age value, the cached response 
is stale. ) 

The must-reval idate directive is necessary to support reliable 
operation for certain protocol features. In all circumstances an 
HTTP/1. 1 cache MUST obey the must-reval idate directive; in 
particular, if the cache cannot reach the origin server for any 
reason, it MUST generate a 504 (Gateway Timeout) response. 

Servers should send the must-reval idate directive if and only if 
failure to revalidate a request on the entity could result in 
incorrect operation, such as a silently unexecuted financial 
transaction. Recipients MUST NOT take any automated action that 
violates this directive, and MUST NOT automatically provide an 
unval i dated copy of the entity if reval idat ion f ai Is. 

Although this is not recommended, user agents operating under severe 
connectivity constraints may violate this directive but, if so, MUST 
explicitly warn the user that an unval idated response has been 
provided. The warning MUST be provided on each unval idated access, 
and SHOULD require explicit user confirmation. 

The proxy-reval idate directive has the same meaning as the must- 
reval idate direct ive, except that it does not apply to non-shared 
user agent caches. It can be used on a response to an authenticated 
request to permit the user's cache to store and later return the 
response without needing to revalidate it (since it has already been 
authenticated once by that user), while still requiring proxies that 
service many users to revalidate each time (in order to make sure 
that each user has been authenticated). Note that such authenticated 
responses also need the public cache control directive in order to 
allow them to be cached at all. 

14.9.5 No-Transform Directive 

Implementers of intermediate caches (proxies) have found it useful to 
convert the media type of certain entity bodies. A proxy might, for 
example, convert between image formats in order to save cache space 
or to reduce the amount of traffic on a slow link. HTTP has to date 
been silent on these transformations. 
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Serious operational problems have already occurred, however, when 
these transformations have been applied to entity bodies intended for 
certain kinds of applications. For example, applications for medical 
imaging, scientific data analysis and those using end-to-end 
authentication, all depend on receiving an entity body that is bit 
for bit identical to the original entity-body. 

Therefore, if a response includes the no-transform directive, an 
intermediate cache or proxy MUST NOT change those headers that are 
listed in section 13.5.2 as being subject to the no-transform 
directive. This implies that the cache or proxy must not change any 
aspect of the entity-body that is specified by these headers. 

14.9.6 Cache Control Extensions 

The Cache-Control header field can be extended through the use of one 
or more cache-extension tokens, each with an optional assigned value. 
Informational extensions (those which do not require a change in 
cache behavior) may be added without changing the semantics of other 
directives. Behavioral extensions are designed to work by acting as 
modifiers to the existing base of cache directives. Both the new 
directive and the standard directive are supplied, such that 
applications which do not understand the new directive will default 
to the behavior specified by the standard directive, and those that 
understand the new directive will recognize it as modifying the 
requirements associated with the standard directive. In this way, 
extensions to the Cache-Control directives can be made without 
requiring changes to the base protocol. 

This extension mechanism depends on a HTTP cache obeying all of the 
cache-control directives defined for its native HTTP-version, obeying 
certain extensions, and ignoring all directives that it does not 
understand. 

For example, consider a hypothetical new response directive called 
"communi ty" which acts as a modifier to the "private" directive. We 
define this new directive to mean that, in addition to any non-shared 
cache, any cache which is shared only by members of the community 
named within its value may cache the response. An origin server 
wishing to allow the "UCI" community to use an otherwise private 
response in their shared cache (s) may do so by including 

Cache-Control: private, communi ty="UCI" 

A cache seeing this header field will act correctly even if the cache 
does not understand the "community" cache-extension, since it will 
also see and understand the "private" directive and thus default to 
the safe behavior. 
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Unrecognized cache-directives MUST be ignored; it is assumed that any 
cache-directive likely to be unrecognized by an HTTP/1.1 cache will 
be combined with standard directives (or the response's default 
cachability) such that the cache behavior will remain minimally 
correct even if the cache does not understand the extension(s) . 

14. 10 Connection 

The Connection general -header field allows the sender to specify 
options that are desired for that particular connection and MUST NOT 
be communicated by proxies over further connections. 

The Connection header has the following grammar: 

Connection-header = "Connection" ":" 1# (connect ion-token) 
connec t i on- token = token 

HTTP/1.1 proxies MUST parse the Connection header field before a 
message is forwarded and, for each connect ion- token in this field, 
remove any header field(s) from the message with the same name as the 
connect ion- token. Connection options are signaled by the presence of 
a connect ion- token in the Connection header field, not by any 
corresponding additional header field(s), since the additional header 
field may not be sent if there are no parameters associated with that 
connection option. HTTP/1. 1 defines the "close" connection option 
for the sender to signal that the connection will be closed after 
completion of the response. For example, 

Connection: close 

in either the request or the response header fields indicates that 
the connection should not be considered persistent* (section 8.1) 
after the current request/response is complete. 

HTTP/1.1 applications that do not support persistent connections MUST 
include the "close" connection option in every message. 

14.11 Content-Base 

The Content-Base entity-header field may be used to specify the base 
URI for resolving relative URLs within the entity. This header field 
is described as Base in RFC 1808, which is expected to be revised. 



If no Content-Base field is present, the base URI of an entity is 
defined either by its Con tent -Location (if that Content-Location URI 
is an absolute URI) or the URI used to initiate the request, in that 



Content-Base 



= "Content-Base" ":" absoluteURI 
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order of precedence. Note, however, that the base URI of the contents 
within the entity-body may be redefined within that entity-body. 

14.12 Con tent -Encoding 

The Content-Encoding entity-header field is used as a modifier to the 
media-type. When present, its value indicates what additional content 
codings have been applied to the entity-body, and thus what decoding 
mechanisms MUST be applied in order to obtain the media-type 
referenced by the Content-Type header field. Con tent -Encoding is 
primarily used to allow a document to be compressed without losing 
the identity of its underlying media type. 

Con tent -Encoding = "Content -Encoding" ":" l#content-coding 

Content codings are defined in section 3.5. An example of its use is 

Content-Encoding: gzip 

The Content-Encoding is a characteristic of the entity identified by 
the Request-URI. Typically, the entity-body is stored with this 
encoding and is only decoded before rendering or analogous usage. 

If multiple encodings have been applied to an entity, the content 
codings MUST be listed in the order in which they were applied. 

Additional information about the encoding parameters MAY be provided 
by other entity-header fields not defined by this specification. 

14. 13 Content-Language 

The Content-Language entity-header field describes the natural 
language (s) of the intended audience for the enclosed entity. Note 
that this may not be equivalent to all the languages used within the 
entity-body. 

Content -Language = "Content-Language" ":" 1# language- tag 

Language tags are defined in section 3.10. The primary purpose of 
Content-Language is to allow a user to identify and differentiate 
entities according to the user's own preferred language. Thus, if the 
body content is intended only for a Danish-1 iterate audience, the 
appropriate field is 

Content-Language: da 

If no Content-Language is specified, the default is that the content 
is intended for all language audiences. This may mean that the sender 
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does not consider it to be specific to any natural language, or that 
the sender does not know for which language it is intended. 

Multiple languages MAY be listed for content that is intended for 
multiple audiences. For example, a rendition of the "Treaty of 
Waitangi," presented simultaneously in the original Maori and English 
versions, would call for 

Content-Language: mi, en 

However, just because multiple languages are present within an entity 
does not mean that it is intended for multiple linguistic audiences. 
An example would be a beginner* s language primer, such as "A First 
Lesson in Latin," which is clearly intended to be used by an 
Engl ish-1 iterate audience. In this case, the Con tent -Language should 
only include "en". 

Content -Language may be applied to any media type — it is not 
limited to textual documents. 

14. 14 Content-Length 

The Content-Length entity-header field indicates the size of the 
message-body, in decimal number of octets, sent to the recipient or, 
in the case of the HEAD method, the size of the entity-body that 
would have been sent had the request been a GET. 

Content-Length = "Con tent -Length" ":" 1*DIGIT 

An example is 

Content-Length: 3495 

Applications SHOULD use this field to indicate the size of the 
message-body to be transferred, regardless of the media type of the 
entity. It must be possible for the recipient to reliably determine 
the end of HTTP/LI requests containing an entity-body, e.g., because 
the request has a valid Con tent -Length field, uses Transfer-Encoding: 
chunked or a multipart body. 

Any Content-Length greater than or equal to zero is a valid value. 
Section 4.4 describes how to determine the length of a message-body 
if a Con tent -Length is not given. 
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Note: The meaning of this field is significantly different from the 
corresponding definition in MIME, where it is an optional field 
used within the 'message/external -body" content-type. In HTTP, it 
SHOULD be sent whenever the message's length can be determined 
prior to being transferred. 

14.15 Content-Location 

The Con tent -Location entity-header field may be used to supply the 
resource location for the entity enclosed in the message. In the case 
where a resource has multiple entities associated with it, and those 
entities actually have separate locations by which they might be 
individually accessed, the server should provide a Content -Location 
for the particular variant which is returned. In addition, a server 
SHOULD provide a Content-Location for the resource corresponding to 
the response entity. 

Content-Location = "Content-Location" ":" 

( absoluteURI I relativeURI ) 

If no Content-Base header field is present, the value of Content- 
Location also defines the base URL for the entity (see section 
14.11). 

The Content-Location value is not a replacement for the original 
requested URI; it is only a statement of the location of the resource 
corresponding to this particular entity at the time of the request. 
Future requests MAY use the Content-Location URI if the desire is to 
identify the source of that particular entity. 

A cache cannot assume that an entity with a Content-Location 
different from the URI used to retrieve it can be used to respond to 
later requests on that Content-Location URI. However, the Content- 
Location can be used to differentiate between multiple entities 
retrieved from a single requested resource, as described in section 
13.6. 

If the Content-Location is a relative URI, the URI is interpreted 
relative to any Content-Base URI provided in the response. If no 
Content-Base is provided, the relative URI is interpreted relative to 
the Request -URI. 
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14. 16 Content-MD5 

The Content-MD5 entity-header field, as defined in RFC 1864 [23], is 
an MD5 digest of the entity-body for the purpose of providing an 
end-to-end message integrity check (MIC) of the entity-body. (Note: a 
MIC is good for detecting accidental modification of the entity-body 
in transit, but is not proof against malicious attacks.) 

Content-MD5 = "Content-MD5" md5-digest 

md5-digest = <base64 of 128 bit MD5 digest as per RFC 1864> 

The Content-MD5 header field may be generated by an origin server to 
function as an integrity check of the entity-body. Only origin 
servers may generate the Content-MD5 header field; proxies and 
gateways MUST NOT generate it, as this would defeat its value as an 
end-to-end integrity check. Any recipient of the entity-body, 
including gateways and proxies, MAY check that the digest value in 
this header field matches that of the entity-body as received. 

The MD5 digest is computed based on the content of the entity-body, - 
including any Con tent -Encoding that has been applied, but not 
including any Transfer-Encoding that may have been applied to the 
message-body. If the message is received with a Transfer-Encoding, 
that encoding must be removed prior to checking the Content-MD5 value 
against the received entity. 

This has the result that the digest is computed on the octets of the 
entity-body exactly as, and in the order that, they would be sent if 
no Transfer-Encoding were being applied. 

HTTP extends RFC 1864 to permit the digest to be computed for MIME 
composite media-types (e.g., multipart/* and message/rf c822) , but 
this does not change how the digest is computed as defined in the 
preceding paragraph. 

Note: There are several consequences of this. The entity-body for 
composite types may contain many body-parts, each with its own MIME 
and HTTP headers (including Content-MD5, Con tent -Transfer -Encoding, 
and Content-Encoding headers) . If a body-part has a Content- 
Trans fer-Encoding or Con tent -Encoding header, it is assumed that 
the content of the body-part has had the encoding applied, and the 
body-part is included in the Content-MD5 digest as is — i.e., 
after the application. The Transfer-Encoding header field is not 
allowed within body-parts. 

Note: while the definition of Content-MD5 is exactly the same for 
HTTP as in RFC 1864 for MIME entity-bodies, there are several ways 
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in which the application of Content-MD5 to HTTP entity-bodies 
differs from its application to MIME entity-bodies. One is that 
HTTP, unlike MIME, does not use Content-Transfer-Encoding, and does 
use Transfer-Encoding and Con tent -Encoding. Another is that HTTP 
more frequently uses binary content types than MIME, so it is worth 
noting that, in such cases, the byte order used to compute the 
digest is the transmission byte order defined for the type. Lastly, 
HTTP allows transmission of text types with any of several line 
break conventions and not just the canonical form using CRLF. 
Conversion of all line breaks to CRLF should not be done before 
computing or checking the digest: the line break convention used in 
the text actually transmitted should be left unaltered when 
computing the digest. 

14. 17 Content-Range 

The Content-Range entity-header is sent with a partial entity-body to 
specify where in the full entity-body the partial body should be 
inserted. It also indicates the total size of the full entity-body. 
When a server returns a partial response to a client, it must 
describe both the extent of the range covered by the response, and 
the length of the entire entity-body. 

Content-Range = "Content-Range" ":" content-range-spec 

content -range-spec = byte-content -range-spec 

byte-content-range-spec = bytes-unit SP f irst-byte-pos 

last-by te-pos "/" entity-length 

entity-length = 1*DIGIT 

Unlike byte-ranges-specifier values, a byte-content-range-spec may 
only specify one range, and must contain absolute byte positions for 
both the first and last byte of the range. 

A byte-content-range-spec whose last-byte-pos value is less than its 
f irst-byte-pos value, or whose en t i ty- length value is less than or 
equal to its last-byte-pos value, is invalid. The recipient of an 
invalid byte-content-range-spec MUST ignore it and any content 
transferred along with it. 
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Examples of byte-content-range-spec values, assuming that the entity 
contains a total of 1234 bytes: 

o The first 500 bytes: 

bytes 0-499/1234 
o The second 500 bytes: 

bytes 500-999/1234 
o All except for the first 500 bytes: 

bytes 500-1233/1234 
o The last 500 bytes: 

bytes 734-1233/1234 

When an HTTP message includes the content of a single range (for 
example, a response to a request for a single range, or to a request 
for a set of ranges that overlap without any holes), this content is 
transmitted with a Content-Range header, and a Content-Length header 
showing the number of bytes actually transferred. For example, 

HTTP/1.1 206 Partial content 

Date: Wed, 15 Nov 1995 06:25:24 GMT 

Last-modified: Wed, 15 Nov 1995 04:58:08 GMT 

Content-Range: bytes 21010-47021/47022 

Content-Length: 26012 

Content-Type: image/gif 

When an HTTP message includes the content of multiple ranges (for 
example, a response to a request for multiple non-overlapping 
ranges), these are transmitted as a multipart MIME message. The 
multipart MIME content-type used for this purpose is defined in this 
specification to be "mul ti part/by teranges". See appendix 19.2 for its 
definition. 

A client that cannot decode a MIME multipart/byteranges message 
should not ask for multiple byte-ranges in a single request. 

When a client requests multiple byte-ranges in one request, the 
server SHOULD return them in the order that they appeared in the 
request. 

If the server ignores a byte- range- spec because it is invalid, the 
server should treat the request as if the invalid Range header field 



Fielding, et. al. 



Standards Track 



[Page 115] 



•RFC 2068 



HTTP/1.1 



January 1997 



did not exist. (Normally, this means return a 200 response containing 
the full entity). The reason is that the only time a client will make 
such an invalid request is when the entity is smaller than the entity 
retrieved by a prior request. 

14. 18 Content-Type 

The Content-Type entity-header field indicates the media type of the 
entity-body sent to the recipient or, in the case of the HEAD method, 
the media type that would have been sent had the request been a GET. 

Content-Type ~ "Content-Type" ":" media-type 
Media types are defined in section 3.7. An example of the field is 

Content-Type: text/html; charset=IS0-8859-4 

Further discussion of methods for identifying the media type of an 
entity is provided in section 7.2.1. 

14. 19 Date 

The Date general -header field represents the date and time at which 
the message was originated, having the same semantics as orig-date in 
RFC 822. The field value is an HTTP-date, as described in section 
3.3.1. 

Date = "Date" ":" HTTP-date 

An example is 

Date: Tue, 15 Nov 1994 08:12:31 GMT 

If a message is received via direct connection with the user agent 
(in the case of requests) or the origin server (in the case of 
responses), then^ the date can be assumed to be the current date at 
the receiving end. However, since the date — as it is believed by the 
origin — is important for evaluating cached responses, origin servers 
MUST include a Date header field in all responses. Clients SHOULD 
only send a Date header field in messages that include an entity- 
body, as in the case of the PUT and POST requests, and even then it 
is optional. A received message which does not have a Date header 
field SHOULD be assigned one by the recipient if the message will be 
cached by that recipient or gatewayed via a protocol which requires a 
Date. 
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In theory, the date SHOULD represent the moment just before the 
entity is generated. In practice, the date can be generated at any 
time during the message origination without affecting its semantic 
value. 

The format of the Date is an absolute date and time as defined by 
HTTP-date in section 3.3; it MUST be sent in RFC1123 [8] -date format. 

14.20 ETag 

The ETag entity-header field defines the entity tag for the 
associated entity. The headers used with entity tags are described in 
sections 14.20, 14.25, 14.26 and 14.43. The entity tag may be used 
for comparison with other entities from the same resource (see 
section 13.3.2). 

ETag = "ETag" ":" entity-tag 

Examples: 

ETag: "xyzzy" 
ETag: W/" xyzzy" 
ETag: "" 

14. 21 Expires 

The Expires entity-header field gives the date/time after which the 
response should be considered stale. A stale cache entry may not 
normally be returned by a cache (either a proxy cache or an user 
agent cache) unless it is first validated with the origin server (or 
with an intermediate cache that has a fresh copy of the entity). See 
section 13.2 for further discussion of the expiration model. 

The presence of an Expires field does not imply that the original 
resource will change or cease to exist at, before, or after that 
time. 

The format is an absolute date and time as defined by HTTP-date in 
section 3.3; it MUST be in RFC1123-date format: 

Expires = "Expires" ":" HTTP-date 
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An example of its use is 

Expires: Thu, 01 Dec 1994 16:00:00 GMT 

Note: if a response includes a Cache-Control field with the max-age 
directive, that directive overrides the Expires field. 

HTTP/1.1 clients and caches MUST treat other invalid date formats, 
especially including the value "0", as in the past (i.e., "already 
expired"). 

To mark a response as "already expired," an origin server should use 
an Expires date that is equal to the Date header value. (See the 
rules for expiration calculations in section 13.2.4.) 

To mark a response as "never expires," an origin server should use an 
Expires date approximately one year from the time the response is 
sent. HTTP/1.1 servers should not send Expires dates more than one 
year in the future. 

The presence of an Expires header field with a date value of some 
time in the future on an response that otherwise would by default be 
non-cacheable indicates that" the response is cachable, unless 
indicated otherwise by a Cache-Control header field (section 14.9). 

14.22 From 

The From request -header field, if given, SHOULD contain an Internet 
e-mail address for the human user who controls the requesting user 
agent. The address SHOULD be machine-usable, as defined by mailbox 
in RFC 822 (as updated by RFC 1123 ): 

From = "From" ":" mailbox 

An example is: 

From: webmaster@w3.org 

This header field MAY be used for logging purposes and as a means for 
identifying the source of invalid or unwanted requests. It SHOULD NOT 
be used as an insecure form of access protection. The interpretation 
of this field is that the request is being performed on behalf of the 
person given, who accepts responsibility for the method performed. ^In 
particular, robot agents SHOULD include this header so that the 
person responsible for running the robot can be contacted if problems 
occur on the receiving end. 
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The Internet e-mail address in this field MAY be separate from the 
Internet host which issued the request. For example, when a request 
is passed through a proxy the original issuer's address SHOULD be 
used. 

Note: The client SHOULD not send the From header field without the 
user's approval, as it may conflict with the user's privacy 
interests or their site's security policy. It is strongly 
recommended that the user be able to disable, enable, and modify 
the value of this field at any time prior to a request. 

14.23 Host 

The Host request -header field specifies the Internet host and port 
number of the resource being requested, as obtained from the original 
URL given by the user or referring resource (generally an HTTP URL, 
as described in section 3.2.2). The Host field value MUST represent 
the network location of the origin server or gateway given by the 
original URL. This allows the origin server or gateway to 
differentiate between internal ly- ambiguous URLs, such as the root "/" 
URL of a server for multiple host names on a single IP address. 

Host = "Host" ":" host [ ":" port ] ; Section 3.2.2 

A "host" without any trailing port information implies the default 
port for the service requested (e.g., "80" for an HTTP URL). For 
example, a request on the origin server for 
<http://www.w3.org/pub/WWW/> MUST include: 

GET /pub/WWW/ HTTP/1. 1 
Host: www.w3.org 

A client MUST include a Host header field in all HTTP/1.1 request 
messages on the Internet (i.e., on any message corresponding to a 
request for a URL which includes an Internet host address for the 
service being requested). If the Host field is not already present, 
an HTTP/1.1 proxy MUST add a Host field to the request message prior 
to forwarding it on the Internet. All Internet-based HTTP/1.1 servers 
MUST respond with a 400 status code to any HTTP/1.1 request message 
which lacks a Host header field. 

See sections 5.2 and 19.5.1 for other requirements relating to Host. 

14.24 If-Modif ied-Since 

The I f -Modi f ied-Since request -header field is used with the GET 
method to make it conditional: if the requested variant has not been 
modified since the time specified in this field, an entity will not 



Fielding, et. al. 



Standards Track 



[Page 119] 



.RFC 2068 



HTTP/1.1 



January 1997 



be returned from the server; instead, a 304 (not modified) response 
will be returned without any message-body. 

If-Modif ied-Since = "If-Modif ied-Since" ":" HTTP-date 

An example of the field is: 

If-lfadi'f ied-Since: Sat, 29 Oct 1994 19:43:31 GMT 

A GET method with an If-Modif ied-Since header and no Range header 
requests that the identified entity be transferred only if it has 
been modified since the date given by the If-Modif ied-Since header. 
The algorithm for determining this includes the following cases: 

a) If the request would normally result in anything other than a 200 
(OK) status, or if the passed If-Modif ied-Since date is invalid, the 
response is exactly the same as for a normal GET. A date which is 
later than the server's current time is invalid. 

b) If the variant has been modified since the If-Modif ied-Since date, 
the response is exactly the same as for a normal GET. 

c) If the variant has not been modified since a valid I f -Modi f ied-Since 
date, the server MUST return a 304 (Not Modified) response. 

The purpose of this feature is to allow efficient updates of cached 
information with a minimum amount of transaction overhead. 

Note that the Range request -header field modifies the meaning of 
If-Modif ied-Since; see section 14.36 for full details. 

Note that If-Modif ied-Since times are interpreted by the server, 
whose clock may not be synchronized with the client. 

Note that if a client uses an arbitrary date in the If-Modif ied-Since 
header instead of a date taken from the Last-Modified header for the 
same request, the client should be aware of the fact that this date 
is interpreted in the server's understanding of time. The client 
should consider unsynchronized clocks and rounding problems due to 
the different encodings of time between the client and server. This 
includes the possibility of race conditions if the document has 
changed between the time it was first requested and the If-Modified- 
Since date of a subsequent request, and the possibility of clock- 
skew-related problems if the If-Modif ied-Since date is derived from 
the client's clock without correction to the server's clock. 
Corrections for different time bases between client and server are at 
best approximate due to network latency. 
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14.25 If-Match 

The If-Match request -header field is used with a method to make it 
conditional. A client that has one or more entities previously 
obtained from the resource can verify that one of those entities is 
current by including a list of their associated entity tags in the 
If-Match header field. The purpose of this feature is to allow 
efficient updates of cached information with a minimum amount of 
transaction overhead. It is also used, on updating requests, to 
prevent inadvertent modification of the wrong version of a resource. 
As a special case, the value "*" matches any current entity of the 
resource. 

If-Match = "If-Match" ":" ( I l#entity-tag ) 

If any of the entity tags match the entity tag of the entity that 
would have been returned in the response to a similar GET request 
(without the If-Match header) on that resource, or if is given 
and any current entity exists for that resource, then the server MAY 
perform the requested method as if the If-Match header field did not 
exist. 

A server MUST use the strong comparison function (see section 3.11) 
to compare the entity tags in If-Match. 

If none of the entity tags match, or if "*" is given and no current 
entity exists, the server MUST NOT perform the requested method, and 
MUST return a 412 (Precondition Failed) response. This behavior is 
most useful when the client wants to prevent an updating method, such 
as PUT, from modifying a resource that has changed since the client 
last retrieved it. 

If the request would, without the If-Match header field, result in 
anything other than a 2xx status, then the If-Match header MUST be 
ignored. 

The meaning of "If-Match: *" is that the method SHOULD be performed 
if the representation selected by the origin server (or by a cache, 
possibly using the Vary mechanism, see section 14.43) exists, and 
MUST NOT be performed if the representation does not exist. 
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A request intended to update a resource (e.g., a PUT) MAY include an 
If-Match header field to signal that the request method MUST NOT be 
applied if the entity corresponding to the If-Match value (a single 
entity tag) is no longer a representation of that resource. This 
allows the user to indicate that they do not wish the request to be 
successful if the resource has been changed without their knowledge. 
Examples: 

If-Match: "xyzzy" 

If-Match: "xyzzy", "r2d2xxxx", "c3pi ozzzz" 
If-Match: * 

14.26 If-None-Match 

The If-None-Match request -header field is used with a method to make 
it conditional. A client that has one or more entities previously 
obtained from the resource can verify that none of those entities is 
current by including a list of their associated entity tags in the 
If-None-Match header field. The purpose of this feature is to allow 
efficient updates of cached information with a minimum amount of 
transaction overhead. It is also used, on updating requests, to 
prevent inadvertent modification of a resource which was not known to 
exist. 

As a special case, the value "*" matches any current entity of the 
resource. 

If-None-Match = "If-None-Match" ":" ( "*" I l#entity-tag ) 

If any of the entity tags match the entity tag of the entity that 
would have been returned in the response to a similar GET request 
(without the If-None-Match header) on that resource, or if "*" is 
given and any current entity exists for that resource, then the 
server MUST NOT perform the requested method. Instead, if the request 
method was GET or HEAD, the server SHOULD respond with a 304 (Not 
Modified) response, including the cache-related entity-header fields 
(particularly ETag) of one of the entities that matched. For all 
other request methods, the server MUST respond with a status of 412 
(Pr econd i t i on Fa i 1 ed) . 

See section 13.3.3 for rules on how to determine if two entity tags 
match. The weak comparison function can only be used with GET or HEAD 
requests. 

If none of the entity tags match, or if "*" is given and no current 
entity exists, then the server MAY perform the requested method as if 
the If-None-Match header field did not exist. 
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If the request would, without the If-None-Match header field, result 
in anything other than a 2xx status, then the If-None-Match header 
MUST be ignored. 

The meaning of "If-None-Match: *" is that the method MUST NOT be 
performed if the representation selected by the origin server (or by 
a cache, possibly using the Vary mechanism, see section 14.43) 
exists, and SHOULD be performed if the representation does not exist. 
This feature may be useful in preventing races between PUT 
operations. 

Examp 1 es : 

I f -None-Ma tch : "xyzzy" 

If-None-Match: W/"xyzzy" 

If-None-Match: "xyzzy", "r2d2xxxx", "c3piozzzz" 

If-None-Match: If/" xyzzy" , W/"r2d2xxxx", W/"c3piozzzz" 

If-None-Match: * 

14.27 If -Range 

If a client has a partial copy of an entity in its cache, and wishes 
to have an up-to-date copy of the entire entity in its cache, it 
could use the Range request -header with a conditional GET (using 
either or both of If-Unmodif ied-Since and If-Match. ) However, if the 
condition fails because the entity has been modified, the client 
would then have to make a second request to obtain the entire current 
entity-body. 

The If-Range header allows a client to "short-circuit" the second 
request. Informally, its meaning is if the entity is unchanged, send 
me the part(s) that I am missing; otherwise, send me the entire new 
entity. * 

If-Range = "If-Range" ":" ( entity-tag I HTTP-date ) 

If the client has no entity tag for an entity, but does have a Last- 
Modified date, it may use that date in a If-Range header. (The server 
can distinguish between a valid HTTP-date and any form of entity-tag 
by examining no more than two characters.) The If-Range header should 
only be used together with a Range header, and must be ignored if the 
request does not include a Range header, or if the server does not 
support the sub-range operation. 



Fielding, et. al. 



Standards Track 



[Page 123] 



.RFC 2068 



HTTP/1.1 



January 1997 



If the entity tag given in the If-Range header matches the current 
entity tag for the entity, then the server should provide the 
specified sub-range of the entity using a 206 (Partial content) 
response. If the entity tag does not match, then the server should 
return the entire entity using a 200 (OK) response. 

14.28 If -Unmodified-Si nee 

The If-Unmodif ied-Since request -header field is used with a method to 
make it conditional. If the requested resource has not been modified 
since the time specified in this field, the server should perform the 
requested opera t ion as if the If-Unmodif ied-Since header were not 
present. 

If the requested variant has been modified since the specified time, 
the server MUST NOT perform the requested operation, and MUST return 
a 412 (Precondition Failed). 

If-Unmodif ied-Since = "If-Unmodif ied-Since" ":" HTTP-date 

An example of the field is: 

If-Unmodif ied-Since: Sat, 29 Oct 1994 19:43:31 GMT 

If the request normally (i.e., without the I f-Unmodi f ied-Since 
header) would result in anything other than a 2xx status, the If- 
Unmodif ied-Since header should be ignored. 

If the specified date is invalid, the header is ignored. 

14.29 Last-Modified 

The Last-Modified entity-header field indicates the date and time at 
which the origin server believes the variant was last modified. 

Last-Modified = "Last-Modified" ":" fflTP-date 

An example of its use is 

Last-Modified: Tue, 15 Nov 1994 12:45:26 GMT 

The exact meaning of this header field depends on the implementation 
of the origin server and the nature of the original resource. For 
files, it may be just the file system last-modified time. For 
entities with dynamically included parts, it may be the most recent 
of the set of last-modify times for its component parts. For database 
gateways, it may be the last-update time stamp of the record. For 
virtual objects, it may be the last time the internal state changed. 
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An origin server MUST NOT send a Last-Modified date which is later 
than the server's time of message origination. In such cases, where 
the resource's last modification would indicate some time in the 
future, the server MUST replace that date with the message 
origination date. 

An origin server should obtain the Last-Modified value of the entity 
as close as possible to the time that it generates the Date value of 
its response. This allows a recipient to make an accurate assessment 
of the entity's modification time, especially if the entity changes 
near the time that the response is generated. 

HTTP/1.1 servers SHOULD send Last-Modified whenever feasible. 

14. 30 Location 

The Location response-header field is used to redirect the recipient 
to a location other than the Request-URI for completion of the 
request or identification of a new resource. For 201 (Created) 
responses, the Location is that of the new resource which was created 
by the request. For 3xx responses, the location SHOULD indicate the 
server's preferred URL for automatic redirection to the resource. The 
field value consists of a single absolute URL. 



An example is 

Location: http://www.w3. org/pub/WWW/Peop 1 e. html 

Note: The Content-Location header field (section 14.15) differs 
from Location in that the Content-Location identifies the original 
location of the entity enclosed in the request. It is therefore 
possible for a response to contain header fields for both Location 
and Content-Location. Also see section 13.10 for cache requirements 
of some methods. 

14.31 Max-Forwards 

The Max-Forwards request -header field may be used with the TRACE 
method (section 14.31) to limit the number of proxies or gateways 
that can forward the request to the next inbound server. This can be 
useful when the client is attempting to trace a request chain which 
appears to be failing or looping in mid-chain. 



Locat i on 



= "Location" ":" absoluteURI 



Max-Forwards = "Max-Forwards" ":" 1*DIGIT 
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The Max-Forwards value is a decimal integer indicating the remaining 
number of times this request message may be forwarded. 

Each proxy or gateway recipient of a TRACE request containing a Max- 
Forwards header field SHOULD check and update its value prior to 
forwarding the request. If the received value is zero (0), the 
recipient SHOULD NOT forward the request; instead, it SHOULD respond 
as the final recipient with a 200 (OK) response containing the 
received request message as the response entity-body (as described in 
section 9.8). If the received Max-Forwards value is greater than 
zero, then the forwarded message SHOULD contain an updated Max- 
Forwards field with a value decremented by one (1). 

The Max-Forwards header field SHOULD be ignored for all other methods 
defined by this specification and for any extension methods for which 
it is not explicitly referred to as part of that method definition. 

14.32 Pragma 

The Pragma general -header field is used to include implementation- 
specific directives that may apply to any recipient along the 
request/response, chain. All pragma directives specify optional 
behavior from the viewpoint of the protocol; however, some systems 
MAY require that behavior be consistent with the directives. 

Pragma = "Pragma" ":" l#pragma-di recti ve 

pragma-directive = "no-cache" I extension-pragma 
extension-pragma = token [ ( token I quoted-string ) ] 

When the no-cache directive is present in a request message, an 
application SHOULD forward the request toward the origin server even 
if it has a cached copy of what is being requested. This pragma 
directive has the same semantics as the no-cache cache-directive (see 
section 14.9) and is defined here for backwards compatibility with 
HTTP/1.0. Clients SHOULD include both header fields when a no-cache 
request is sent to a server not known to be HTTP/1.1 compliant. 

Pragma directives MUST be passed through by a proxy or gateway 
application, regardless of their significance to that application, 
since the directives may be applicable to all recipients along the 
request/response chain. It is not possible to specify a pragma for a 
specific recipient; however, any pragma directive not relevant to a 
recipient SHOULD be ignored by that recipient. 
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HTTP/1.1 clients SHOULD NOT send the Pragma request-header. HTTP/1.1 
caches SHOULD treat "Pragma: no-cache" as if the client had sent 
"Cache-Control: no-cache". No new Pragma directives will be defined 
in HTTP. 

14.33 Proxy-Authenticate 

The Proxy-Authenticate response-header field MUST be included as part 
of a 407 (Proxy Authentication Required) response. The field value 
consists of a challenge that indicates the authentication scheme and 
parameters applicable to the proxy for this Request-URI. 

Proxy-Authenticate = "Proxy-Authenticate" ":" challenge 

The HTTP access authentication process is described in section 11. 
Unlike WWW-Authent icate, the Proxy-Authenticate header field applies 
only to the current connection and SHOULD NOT be passed on to 
downstream clients. However, an intermediate proxy may need to obtain 
its own credentials by requesting them from the downstream client, 
which in some circumstances will appear as if the proxy is forwarding 
the Proxy-Authenticate header field. ~ 

14.34 Proxy-Authorization 

The Proxy-Authorization request-header field allows the client to 
identify itself (or its user) to a proxy which requires 
authentication. The Proxy-Authorization field value consists of 
credentials containing the authentication information of the user 
agent for the proxy and/or realm of the resource being requested. 

Proxy-Authorization = "Proxy-Authorizat ion" ":" credentials 

The HTTP access authentication process is described in section 11. 
Unlike Authorization, the Proxy-Authorization header field applies 
only to the next outbound proxy that demanded authentication using 
the Proxy-Authenticate field. When multiple proxies are used in a 
chain, the Proxy-Authorization header field is consumed by the first 
outbound proxy that was expecting to receive credentials. A proxy MAY 
relay the credentials from the client request to the next proxy if 
that is the mechanism by which the proxies cooperatively authenticate 
a given request. 

14.35 Public 

The Public response-header field lists the set of methods supported 
by the server. The purpose of this field is strictly to inform the 
recipient of the capabilities of the server regarding unusual 
methods. The methods listed may or may not be applicable to the 



Fielding, et. al. 



Standards Track 



[Page 127] 



. RFC 2068 



HTTP/1.1 



January 1997 



Request-URI; the Allow header field (section 14.7) MAY be used to 
indicate methods allowed for a particular URL 

Public = "Public" ":" lftnethod 

Example of use: 

Public: OPTIONS, MGET, MHEAD, GET, HEAD 

This header field applies only to the server directly connected to 
the client (i.e., the nearest neighbor in a chain of connections). If 
the response passes through a proxy, the proxy MUST either remove the 
Public header field or replace it with one applicable to its own 
capabilities. 

14.36 Range 

14.36. 1 Byte Ranges 

Since all HTTP entities are represented in HTTP messages as sequences 
of bytes, the concept of a byte range is meaningful for any HTTP 
entity. (However, not all clients and servers need to support byte- 
range operations.) 

Byte range specifications in HTTP apply to the sequence of bytes in 
the entity-body (not necessarily the same as the message-body). 

A byte range operation may specify a single range of bytes, or a set 
of ranges within a single entity. 

ranges-specifier = byte-ranges-specifier 

byte-ranges-specifier = bytes-unit "=" byte- range-set 

byte-range-set = 1#( byte-range-spec I suffix-byte-range-spec ) 

byte-range-spec = f irst-byte-pos [last-byte-pos] 

f irst-byte-pos = 1*DIGIT 

last-byte-pos = 1*DIGIT 

The f irst-byte-pos value in a byte-range-spec gives the byte-offset 
of the first byte in a range. The last-byte-pos value gives the 
byte-offset of the last byte in the range; that is, the byte 
positions specified are inclusive. Byte offsets start at zero. 
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If the last-byte-pos value is present, it must be greater than or 
equal to the f irst-byte-pos in that byte-range-spec, or the byte- 
range-spec is invalid. The recipient of an invalid byte-range-spec 
must ignore it. 

If the last-byte-pos value is absent, or if the value is greater than 
or equal to the current length of the entity-body, last-byte-pos is 
taken to be equal to one less than the current length of the entity- 
body in bytes. 

By its choice of last-byte-pos, a client can limit the number of 
bytes retrieved without knowing the size of the entity. 

suffix-byte-range-spec = "-" suffix-length 

suffix-length = 1*DIGIT 

A suffix-byte-range-spec is used to specify the suffix of the 
entity-body, of a length given by the suffix-length value. (That is, 
this form specifies the last N bytes of an entity-body.) If the 
entity is shorter than the specified suffix- length, the entire 
entity-body is used. 

Examples of byte-ranges-specifier values (assuming an entity-body of 
length 10000): 

o The first 500 bytes (byte offsets 0-499, inclusive): 
bytes=0-499 

o The second 500 bytes (byte offsets 500-999, inclusive): 
bytes=500-999 

o The final 500 bytes (byte offsets 9500-9999, inclusive): 
bytes=-500 

o Or 

bytes=9500- 

o The first and last bytes only (bytes 0 and 9999): 
bytes=0-0, -1 
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o Several legal but not canonical specifications of the second 
500 bytes (byte offsets 500-999, inclusive): 

bytes=500-600, 601-999 

bytes=500-700, 601-999 

14.36.2 Range Retrieval Requests 

HTTP retrieval requests using conditional or unconditional GET 
methods may request one or more sub-ranges of the entity, instead of 
the entire entity, using the Range request header, which applies to 
the entity returned as the result of the request: 

Range = "Range" ":" ranges-specifier 

A server MAY ignore the Range header. However, HTTP/1.1 origin 
servers and intermediate caches SHOULD support byte ranges when 
possible, since Range supports efficient recovery from partially 
failed transfers, and supports efficient partial retrieval of large 
entities. 

If the server supports the Range header and the specified range or 
ranges are appropriate for the entity: 

o The presence of a Range header in an unconditional GET modifies 
what is returned if the GET is otherwise successful. In other 
words, the response carries a status code of 206 (Partial 
Content) instead of 200 (OK). 

o The presence of a Range header in a conditional GET (a request 
using one or both of If-Modif ied-Since and If-None-Match, or 

on^of~b"oth _ of~rf-Unmodrf ied=Since~and-If^Match)~modi f ies what 

is returned if the GET is otherwise successful and the condition 
is true. It does not affect the 304 (Not Modified) response 
returned if the conditional is false. 

In some cases, it may be more appropriate to use the If-Range header 
(see section 14.27) in addition to the Range header. 

If a proxy that supports ranges receives a Range request, forwards 
the request to an inbound server, and receives an entire entity in 
reply, it SHOULD only return the requested range to its client. It 
SHOULD store the entire received response in its cache, if that is 
consistent with its cache allocation policies. 
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14.37 Referer 

The Referer [sic] request -header field allows the client to specify, 
for the server* s benefit, the address (URI) of the resource from 
which the Request-URI was obtained (the "referrer", although the 
header field is misspelled.) The Referer request -header allows a 
server to generate lists of back-links to resources for interest, 
logging, optimized caching, etc. It also allows obsolete or mistyped 
links to be traced for maintenance. The Referer field MUST NOT be 
sent if the Request-URI was obtained from a source that does not have 
its own URI, such as input from the user keyboard. 

Referer = "Referer" ":" ( absoluteURI I relativeURI ) 

Example: 

Referer : ht tp ://www. w3. org/hyper text/DataSources/Overvi ew. html 

If the field value is a partial URI, it SHOULD be interpreted 
relative to the Request-URI. The URI MUST NOT include a fragment. 

Note: Because the source of a link may be private information or 
may reveal an otherwise private information source, it is strongly 
recommended that the user be able to select whether or not the 
Referer field is sent. For example, a browser client could have a 
toggle switch for browsing openly/anonymously, which would 
respectively enable/disable the sending of Referer and From 
information. 

14.38 Retry-After 

The Retry-After response-header field can be used with a 503 (Service 
Unavailable) response to indicate how long the service is expected to 
be unavailable to the requesting client. The value of this field can 
be either an HTTP-date or an integer number of seconds (in decimal) 
after the time of the response. 

Retry-After = "Retry-After" ":" ( HTTP-date I delta-seconds ) 
Two examples of its use are 

Retry-After: Fri, 31 Dec 1999 23:59:59 GMT 
Retry-After: 120 

In the latter example, the delay is 2 minutes. 
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14. 39 Server 

The Server response-header field contains information about the 
software used by the origin server to handle the request. The field 
can contain multiple product tokens (section 3.8) and comments 
identifying the server and any significant subproducts. The product 
tokens are listed in order of their significance for identifying the 
application. 

Server = "Server" ":" 1*( product I comment ) 

Example: 

Server: CERN/3.0 libwww/2. 17 

If the response is being forwarded through a proxy, the proxy 
application MUST NOT modify the Server response-header. Instead, it 
SHOULD include a Via field (as described in section 14.44). 

Note: Revealing the specific software version of the server may 
allow the server machine to become more vulnerable to attacks 
against software that is known to contain security holes. Server 
implementers are encouraged to make this field a configurable 
option. 

14.40 Transfer-Encoding 

The Transfer-Encoding general -header field indicates what (if any) 
type of transformation has been applied to the message body in order 
to safely transfer it between the sender and the recipient. This 
differs from the Content -Encoding in that the transfer coding is a 
property of the message, not of the entity. 

Transfer-Encoding = "Transfer-Encoding" ":" ltttransfer- 

coding 

Transfer codings are defined in section 3.6. An example is: 

Transfer-Encoding: chunked 

Many older HTTP/1.0 applications do not understand the Transfer- 
Encoding header. 

14.41 Upgrade 

The Upgrade general -header allows the client to specify what 
additional communication protocols it supports and would like to use 
if the server finds it appropriate to switch protocols. The server 
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MUST use the Upgrade header field within a 101 (Switching Protocols) 
response to indicate which protocol (s) are being switched. 

Upgrade = "Upgrade" ":" ltfproduct 

For example, 

Upgrade: HTTP/2.0, SfflTP/1.3, IRC/6.9, RTA/xll 

The Upgrade header field is intended to provide a simple mechanism 
for transition from HTTP/1 . 1 to some other, incompatible protocol. It 
does so by allowing the client to advertise its desire to use another 
protocol, such as a later version of HTTP with a higher major version 
number, even though the current request has been made using HTTP/1.1. 
This eases the difficult transition between incompatible protocols by 
allowing the client to initiate a request in the more commonly 
supported protocol while indicating to the server that it would like 
to use a "better" protocol if available (where "better" is determined 
by the server, possibly according to the nature of the method and/or 
resource being requested). 

The Upgrade header field only applies to switching appl i cat ion- layer 
protocols upon the existing transport- layer connection. Upgrade 
cannot be used to insist on a protocol change; its acceptance and use 
by the server is optional. The capabilities and nature of the 
appl i cat ion- layer communication after the protocol change is entirely 
dependent upon the new protocol chosen, although the first action 
after changing the protocol MUST be a response to the initial HTTP 
request containing the Upgrade header field. 

The Upgrade header field only applies to the immediate connection. 
Therefore, the upgrade keyword MUST be supplied within a Connection 
header field (section 14.10) whenever Upgrade is present in an 
HTTP/1. 1 message. 

The Upgrade header field cannot be used to indicate a switch to a 
protocol on a different connection. For that purpose, it is more 
appropriate to use a 301, 302, 303, or 305 redirection response. 

This specification only defines the protocol name "HTTP" for use by 
the family of Hypertext Transfer Protocols, as defined by the HTTP 
version rules of section 3.1 and future updates to this 
specification. Any token can be used as a protocol name; however, it 
will only be useful if both the client and server associate the name 
with the same protocol. 
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14.42 User-Agent 

The User-Agent request -header field contains information about the 
user agent originating the request. This is for statistical purposes, 
the tracing of protocol violations, and automated recognition of user 
agents for the sake of tailoring responses to avoid particular user 
agent limitations. User agents SHOULD include this field with 
requests. The field can contain multiple product tokens (section 3.8) 
and comments identifying the agent and any subproducts which form a 
significant part of the user agent. By convention, the product tokens 
are listed in order of their significance for identifying the 
application. 

User-Agent = "User-Agent" ":" 1*( product I comment ) 
Example: 

User-Agent: CERN-LineMode/2. 15 1 ibwww/2. 17b3 

14.43 Vary 

The Vary response-header field is used by a server to signal that the 
response entity was selected from the available representations of 
the response using server-driven negotiation (section 12). Field- 
names listed in Vary headers are those of request-headers. The Vary 
field value indicates either that the given set of header fields 
encompass the dimensions over which the representation might vary, or 
that the dimensions of variance are unspecified ("*") and thus may 
vary over any aspect of future requests. 

Vary = "Vary" ":" ( I l#fi eld-name ) 

An HTTP/1.1 server MUST include an appropriate Vary header field with 
any cachable response that is subject to server-driven negotiation. 
Doing so allows a cache to properly interpret future requests on that 
resource and informs the user agent about the presence of negotiation 
on that resource. A server SHOULD include an appropriate Vary header 
field with a non-cachable response that is subject to server-driven 
negotiation, since this might provide the user agent with useful 
information about the dimensions over which the response might vary. 

The set of header fields named by the Vary field value is known as 
the "selecting" request-headers. 

When the cache receives a subsequent request whose Request-URI 
specifies one or more cache entries including a Vary header, the 
cache MUST NOT use such a cache entry to construct a response to the 
new request unless all of the headers named in the cached Vary header 
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are present in the new request, and all of the stored selecting 
request-headers from the previous request match the corresponding 
headers in the new request. 

The selecting request -headers from two requests are defined to match 
if and only if the selecting request-headers in the first request can 
be transformed to the selecting request -headers in the second request 
by adding or removing linear whitespace (LWS) at places where this is 
allowed by the corresponding BNF, and/or combining multiple message- 
header fields with the same field name following the rules about 
message headers in section 4.2. 

A Vary field value of "*" signals that unspecified parameters, 
possibly other than the contents of request -header fields (e.g., the 
network address of the client), play a role in the selection of the 
response representation. Subsequent requests on that resource can 
only be properly interpreted by the origin server, and thus a cache 
MUST forward a (possibly conditional) request even when it has a 
fresh response cached for the resource. See section 13.6 for use of 
the Vary header by caches. 

A Vary field value consisting of a list of field-names signals that 
the representation selected for the response is based on a selection 
algorithm which considers ONLY the listed request -header field values 
in selecting the most appropriate representation. A cache MAY assume 
that the same selection will be made for future requests with the 
same values for the listed field names, for the duration of time in 
which the response is fresh. 

The field-names given are not limited to the set of standard 
request -header fields defined by this specification. Field names are 
case- i nsens i t i ve. 

14.44 Via 

The Via general -header field MUST be used by gateways and proxies to 
indicate the intermediate protocols and recipients between the user 
agent and the server on requests, and between the origin server and 
the client on responses. It is analogous to the "Received" field of 
RFC 822 and is intended to be used for tracking message forwards, 
avoiding request loops, and identifying the protocol capabilities of 
all senders along the request/response chain. 
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Via = "Via" ":" 1#( received-protocol received-by [ comment ] ) 

received-protocol = [ protocol-name V" ] protocol -vers ion 

protocol -name = token 

protocol -vers ion ~ token 

received-by = ( host [ ":" port ] ) I pseudonym 

pseudonym = token 

The received-protocol indicates the protocol version of the message 
received by the server or client along each segment of the 
request /response chain. The received-protocol version is appended to 
the Via field value when the message is forwarded so that information 
about the protocol capabilities of upstream applications remains 
visible to all recipients. 

The protocol-name is optional if and only if it would be "HTTP". The 
received-by field is normally the host and optional port number of a 
recipient server or client that subsequently forwarded the message. 
However, if the real host is considered to be sensitive information, 
it MAY be replaced by a pseudonym. If the port is not given, it MAY 
be assumed to be the default port of the received-protocol. 

Multiple Via field values represent each proxy or gateway that has 
forwarded the message. Each recipient MUST append its information 
such that the end result is ordered according to the sequence of 
forwarding applications. 

Comments MAY be used in the Via header field to identify the software 
of the recipient proxy or gateway, analogous to the User-Agent and 
Server header fields. However, all comments in the Via field are 
optional and MAY be removed by any recipient prior to forwarding the 
message. 

For example, a request message could be sent from an HTTP/1.0 user 
agent to an internal proxy code-named "fred", which uses HTTP/1. 1 to 
forward the request to a public proxy at nowhere.com, which completes 
the request by forwarding it to the origin server at www.ics.uci.edu. 
The request received by www.ics.uci.edu would then have the following 
Via header field: 

Via: 1.0 fred, 1.1 nowhere.com (Apache/1.1) 

Proxies and gateways used as a portal through a network firewall 
SHOULD NOT, by default, forward the names and ports of hosts within 
the firewall region. This information SHOULD only be propagated if 
explicitly enabled. If not enabled, the received-by host of any host 
behind the firewall SHOULD be replaced by an appropriate pseudonym 
for that host. 
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For organizations that have strong privacy requirements for hiding 
internal structures, a proxy MAY combine an ordered subsequence of 
Via header field entries with identical received-protocol values into 
a single such entry. For example, 

Via: 1.0 ricky, 1.1 ethel, 1.1 fred, 1.0 lucy 

could be collapsed to 

Via: 1.0 ricky, 1.1 mertz, 1.0 lucy 

Applications SHOULD NOT combine multiple entries unless they are all 
under the same organizational control and the hosts have already been 
replaced by pseudonyms. Applications MUST NOT combine entries which 
have different received-protocol values. 

14.45 Warning 

The Warning response-header field is used to carry additional 
information about the status of a response which may; not be reflected 
by the response status code. This information is typically, though 
not exclusively, used to warn about a possible lack of semantic 
transparency from caching operations. 

Warning headers are sent with responses using: 

Warning = "Warning" ":" l#warning-value 

warning-value = warn-code SP warn-agent SP warn- text 
warn-code = 2DIGIT 

warn-agent = ( host [ ":" port ] ) I. pseudonym 

; the name or pseudonym of the server adding 
; the Warning header, for use in debugging 

warn-text = quoted-string 

A response may carry more than one Warning header. 

The warn-text should be in a natural language and character set that 
is most likely to be intelligible to the human user receiving the 
response. This decision may be based on any available knowledge, 
such as the location of the cache or user, the Accept -Language field 
in a request, the Content-Language field in a response, etc. The 
default language is English and the default character set is ISO- 
8859-1. 

If a character set other than IS0-8859-1 is used, it MUST be encoded 
in the warn-text using the method described in RFC 1522 [14]. 
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Any server or cache may add Warning headers to a response. New 
Warning headers should be added after any existing Warning headers. A 
cache MUST NOT delete any Warning header that it received with a 
response. However, if a cache successfully validates a cache entry, 
it SHOULD remove any Warning headers previously attached to that 
entry except as specified for specific Warning codes. It MUST then 
add any Warning headers received in the validating response. In other 
words, Warning headers are those that would be attached to the most 
recent relevant response. 

When multiple Warning headers are attached to a response, the user 
agent SHOULD display as many of them as possible, in the order that 
they appear in the response. If it is not possible to display all of 
the warnings, the user agent should follow these heuristics: 

o Warnings that appear early in the response take priority over those 

appearing later in the response, 
o Warnings in the user's preferred character set take priority over 

warnings in other character sets but with identical warn-codes and 

warn-agents. 

Systems that generate multiple Warning headers should order them with 
this user agent behavior in mind. 

This is a list of the currently-defined warn-codes, each with a 
recommended warn-text in English, and a description of its meaning. 

10 Response is stale 

MUST be included whenever the returned response is stale. A cache may 
add this warning to any response, but may never remove it until the 
response is known to be fresh. 

11 Revalidation failed 

MUST be included if a cache returns a stale response because an 
attempt to revalidate the response failed, due to an inability to 
reach the server. A cache may add this warning to any response, but 
may never remove it until the response is successfully revalidated. 

12 Disconnected operation 

SHOULD be included if the cache is intentionally disconnected from 
the rest of the network for a period of time. 

13 Heuristic expiration 

MUST be included if the cache heurist ical ly chose a freshness 
lifetime greater than 24 hours and the response's age is greater than 
24 hours. 
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14 Transformation applied 

MUST be added by an intermediate cache or proxy if it applies any 
transformation changing the content-coding (as specified in the 
Content-Encoding header) or media-type (as specified in the 
Content-Type header) of the response, unless this Warning code 
already appears in the response. MUST NOT be deleted from a response 
even after revalidation. 

99 Miscellaneous warning 

The warning text may include arbitrary information to be presented to 
a human user, or logged. A system receiving this warning MUST NOT 
take any automated action. 

14.46 WWW-Authenticate 

The WWW-Authenticate response-header field MUST be included in 401 
(Unauthorized) response messages. The field value consists of at 
least one challenge that indicates the authentication scheme(s) and 
parameters applicable to the Request-URI. 

WWW-Authenticate = "WWW-Authenticate" ":" l#challenge 

The HTTP access authentication process is described in section 11. 
User agents MUST take special care in parsing the WWW-Authenticate 
field value if it contains more than one challenge, or if more than 
one WWW-Authenticate header field is provided, since the contents of 
a challenge may itself contain a comma-separated list of 
authentication parameters. 

15 Security Considerations 

This section is meant to inform application developers, information 
providers, and users of the security limitations in HTTP/1.1 as 
described by this document. The discussion does not include 
definitive solutions to the problems revealed, though it does make 
some suggestions for reducing security risks. 

15.1 Authentication of Clients 

The Basic authentication scheme is not a secure method of user 
authentication, nor does it in any way protect the entity, which is 
transmitted in clear text across the physical network used as the 
carrier. HTTP does not prevent additional authentication schemes and 
encryption mechanisms from being employed to increase security or the 
addition of enhancements (such as schemes to use one-time passwords) 
to Basic authentication. 
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The most serious flaw in Basic authentication is that it results in 
the essentially clear text transmission of the user's password over 
the physical network. It is this problem which Digest Authentication 
attempts to address. 

Because Basic authentication involves the clear text transmission of 
passwords it SHOULD never be used (without enhancements) to protect 
sensitive or valuable information. 

A common use of Basic authentication is for identification purposes 
— requiring the user to provide a user name and password as a means 
of identification, for example, for purposes of gathering accurate 
usage statistics on a server. When used in this way it is tempting to 
think that there is no danger in its use if illicit access to the 
protected documents is not a major concern. This is only correct if 
the server issues both user name and password to the users and in 
particular does not allow the user to choose his or her own password. 
The danger arises because naive users frequently reuse a single 
password to avoid the task of maintaining multiple passwords. 

If a server permits users to select their own passwords; then the 
threat is not only illicit access to documents on the server but also 
illicit access to the accounts of all users who have chosen to use 
their account password. If users are allowed to choose their own 
password that also means the server must maintain files containing 
the (presumably encrypted) passwords. Many of these may be the 
account passwords of users perhaps at distant sites. The owner or 
administrator of such a system could conceivably incur liability if 
this information is not maintained in a secure fashion. 

Basic Authentication is also vulnerable to spoofing by counterfeit 
servers. If a user can be led to believe that he is connecting to a 
host containing information protected by basic authentication when in 
fact he is connecting to a hostile server or gateway then the 
attacker can request a password, store it for later use, and feign an 
error. This type of attack is not possible with Digest Authentication 
[32]. Server implementers SHOULD guard against the possibi 1 i ty of 
this sort of counterfeiting by gateways or CGI scripts. In particular 
it is very dangerous for a server to simply turn over a connection to 
a gateway since that gateway can then use the persistent connection 
mechanism to engage in multiple transactions with the client while 
impersonating the original server in a way that is not detectable by 
the client. 

15.2 Offering a Choice of Authentication Schemes 

An HTTP/1.1 server may return multiple challenges with a 401 
(Authenticate) response, and each challenge may use a different 
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scheme. The order of the challenges returned to the user agent is in 
the order that the server would prefer they be chosen. The server 
should order its challenges with the "most secure" authentication 
scheme first. A user agent should choose as the challenge to be made 
to the user the first one that the user agent understands. 

When the server offers choices of authentication schemes using the 
WWW-Authenticate header, the "security" of the authentication is only 
as malicious user could capture the set of challenges and try to 
authenticate him/herself using the weakest of the authentication 
schemes. Thus, the ordering serves more to protect the user's 
credentials than the server's information. 

A possible man- in-t he-middle (MITM) attack would be to add a weak 
authentication scheme to the set of choices, hoping that the client 
will use one that exposes the user's credentials (e.g. password). For 
this reason, the client should always use the strongest scheme that 
it understands from the choices accepted. 

An even better MITM attack would be to remove all offered choices, 
and to insert a challenge that requests Basic authentication. For 
this reason, user agents that are concerned about this kind of attack 
could remember the strongest authentication scheme ever requested by 
a server and produce a warning message that requires user 
confirmation before using a weaker one. A particularly insidious way 
to mount such a MITM attack would be to offer a "free" proxy caching 
service to gullible users. 

15.3 Abuse of Server Log Information 

A server is in the position to save personal data about a user's 
requests which may identify their reading patterns or subjects of 
interest. This information is clearly confidential in nature and its 
handling may be constrained by law in certain countries. People using 
the HTTP protocol to provide data are responsible for ensuring that 
such material is not distributed without the permission of any 
individuals that are identifiable by the published results. 

15.4 Transfer of Sensitive Information 

Like any generic data transfer protocol, HTTP cannot regulate the 
content of the data that is transferred, nor is there any a priori 
method of determining the sensitivity of any particular piece of 
information within the context of any given request. Therefore, 
applications SHOULD supply as much control over this information as 
possible to the provider of that information. Four header fields are 
worth special mention in this context: Server, Via, Referer and From. 
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Revealing the specific software version of the server may allow the 
server machine to become more vulnerable to attacks against software 
that is known to contain security holes. Implementers SHOULD make the 
Server header field a configurable option. 

Proxies which serve as a portal through a network firewall SHOXD 
take special precautions regarding the transfer of header information 
that identifies the hosts behind the firewall. In particular, they 
SHOULD remove, or replace with sanitized versions, any Via fields 
generated behind the firewall. 

The Referer field allows reading patterns to be studied and reverse 
links drawn. Although it can be very useful, its power can be abused 
if user details are not separated from the information contained in 
the Referer. Even when the personal information has been removed, the 
Referer field may indicate a private document's URI whose publication 
would be inappropriate. 

The information sent in the From field might conflict with the user's 
privacy interests or their site's security policy, and hence it 
SHOULD NOT be transmitted without the user being able to disable, 
enable, and modify the contents of the field. The user MUST be able 
to set the contents of this field within a user preference or 
application defaults configuration. 

We suggest, though do not require, that. a convenient toggle interface 
be provided for the user to enable or disable the sending of From and 
Referer information. 

15.5 Attacks Based On File and Path Names 

Implementations of HTTP origin servers SHOULD be careful to restrict 
the documents returned by HTTP requests to be only those that were 
intended by the server administrators. If an HTTP server translates 
HTTP URIs directly into file system calls, the server MUST take 
special care not to serve files that were not intended to be 
delivered to HTTP clients. For example, UNIX, Microsoft Windows, and 
other operating systems use as a path component to indicate a 

directory level above the current one. On such a system, an HTTP 
server MUST disallow any such construct in the Request-URI if it 
would otherwise allow access to a resource outside those intended to 
be accessible via the HTTP server. Similarly,, files intended for 
reference only internally to the server (such as access control 
files, configuration files, and script code) MUST be protected from 
inappropriate retrieval, since they might contain sensitive 
information. Experience has shown that minor bugs in such HTTP server 
implementations have turned into security risks. 
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15.6 Personal Information 

HTTP clients are often privy to large amounts of personal information 
(e.g. the user's name, location, mail address, passwords, encryption 
keys, etc.), and SHOULD be very careful to prevent unintentional 
leakage of this information via the HTTP protocol to other sources. 
We very strongly recommend that a convenient interface be provided 
for the user to control dissemination of such information, and that 
designers and implementers be particularly careful in this area. 
History shows that errors in this area are often both serious 
security and/or privacy problems, and often generate highly adverse 
publicity for the implemented s company. 

15.7 Privacy Issues Connected to Accept Headers 

Accept request-headers can reveal information about the user to all 
servers which are accessed. The Accept -Language header in particular 
can reveal information the user would consider to be of a private 
nature, because the understanding of particular languages is often 
strongly correlated to the membership of a particular ethnic group. 
User agents which offer the option to configure the contents of an 
Accept -Language header to be sent in every request are strongly 
encouraged to let the configuration process include a message which 
makes the user aware of the loss of privacy involved. 

An approach that limits the loss of privacy would be for a user agent 
to omit the sending of Accept -Language headers by default, and to ask 
the user whether it should start sending Accept-Language headers to a 
server if it detects, by looking for any Vary response-header fields 
generated by the server, that such sending could improve the quality 
of service. 

Elaborate user-customized accept header fields sent in every request, 
in particular if these include quality values, can be used by servers 
as relatively reliable and long-lived user identifiers. Such user 
identifiers would allow content providers to do click-trail tracking, 
and would allow collaborating content providers to match cross-server 
click-trails or form submissions of individual users. Note that for 
many users not behind a proxy, the network address of the host 
running the user agent will also serve as a long-lived user 
identifier. In environments where proxies are used to enhance 
privacy, user agents should be conservative in offering accept header 
configuration options to end users. As an extreme privacy measure, 
proxies could filter the accept headers in relayed requests. General 
purpose user agents which provide a high degree of header 
configurability should warn users about the loss of privacy which can 
be involved. 
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15.8 DNS Spoofing 

Clients using HTTP rely heavily on the Domain Name Service, and are 
thus generally prone to security attacks based on the deliberate 
mis-association of IP addresses and DNS names. Clients need to be 
cautious in assuming the continuing validity of an IP number/DNS name 
association. 

In particular, HTTP clients SHOULD rely on their name resolver for 
confirmation of an IP number/DNS name association, rather than 
caching the result of previous host name lookups. Many platforms 
already can cache host name lookups locally when appropriate, and 
they SHOULD be configured to do so. These lookups should be cached, 
however, only when the TTL (Time To Live) information reported by the 
name server makes it likely that the cached information will remain 
useful. 

If HTTP clients cache the results of host name lookups in order to 
achieve a performance improvement, they MUST observe the TTL 
information reported by DNS. 

If HTTP clients do not observe this rule, they could be spoofed when 
a previously-accessed server's IP address changes. As network 
renumbering is expected to become increasingly common, the 
possibility of this form of attack will grow. Observing this 
requirement thus reduces this potential securi ty vulnerabi 1 i ty. 

This requirement also improves the load-balancing behavior of clients 
for replicated servers using the same DNS name and reduces the 
likelihood of a user's experiencing failure in accessing sites which 
use that strategy. 

15.9 Location Headers and Spoofing 

If a single server supports multiple organizations that do not trust 
one another, then it must check the values of Location and Content- 
Location headers in responses that are generated under control of 
said organizations to make sure that they do not attempt to 
invalidate resources over which they have no authority. 
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19 Appendices 

19.1 Internet Media Type message/http 

In addition to defining the HTTP/1.1 protocol, this document serves 
as the specification for the Internet media type "message/http". The 
following is to be registered with IANA. 

Media Type name: message 

Media subtype name: http 

Required parameters: none 

Optional parameters: version, msgtype 

version: The hTTTP-Version number of the enclosed message 

(e.g., "1.1"). If not present, the version can be 
determined from the first line of the body. 

msgtype: The message type — "request" or "response". If not 
present, the type can be determined from the first 
1 ine of the body. 

Encoding considerations: only "7bit", "8bit", or "binary" are 

permitted 

Security considerations: none 

19.2 Internet Media Type mul t ipart/byteranges 

When an HTTP message includes the content of multiple ranges (for 
example, a response to a request for multiple non-overlapping 
ranges), these are transmitted as a multipart MIME message. The 
multipart media type for this purpose is called 
"mul t ipart/byteranges" . 

The mul t ipart/byteranges media type includes two or more parts, each 
with its own Content-Type and Content-Range fields. The parts are 
separated using a MIME boundary parameter. 

Media Type name: multipart 
Media subtype name: byteranges 
Required parameters: boundary 
Optional parameters: none 

Encoding considerations: only "7bit", "8bit", or "binary" are 

permitted 

Security considerations: none 
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For example: 

HTTP/1. 1 206 Partial content ' 
Date: Wed, 15 Nov 1995 06:25:24 GMT 
Last-modified: Wed, 15 Nov 1995 04:58:08 GMT 

Content-type: mul t ipart/byteranges ; boundary=THIS_STRING_SEPARATES 

— THIS_STRING_SEPARATES 
Content-type: appl icat ion/pdf 
Content-range: bytes 500-999/8000 

. . . the first range. . . 
— THIS_STRING_SEPARATES 
Content-type: appl icat ion/pdf 
Content-range: bytes 7000-7999/8000 

. . . the second range 
~THIS_STRING_SEPARATES — 

19. 3 Tolerant Applications 

Although this document specifies the requirements for the generation 
of HTTP/1.1 messages, not all applications will be correct in their 
implementation. We therefore recommend that operational applications 
be tolerant of deviations whenever those deviations can be 
interpreted unambiguously. 

Clients SHOULD be tolerant in parsing the Status-Line and servers 
tolerant when parsing the Request-Line. In particular, they SHOULD 
accept any amount of SP or HT characters between fields, even though 
only a single SP is required. 

The line terminator for message-header fields is the sequence CRLF. 
However, we recommend that applications, when parsing such headers, 
recognize a single LF as a line terminator and ignore the leading CR. 

The character set of an entity-body should be labeled as the lowest 
common denominator of the character codes used within that body, with 
the exception that no label is preferred over the labels US-ASCII or 
ISO-8859-L 

Additional rules for requirements on parsing and encoding of dates 
and other potential problems with date encodings include: 

o HTTP/1. 1 clients and caches should assume that an RFC-850 date 
which appears to be more than 50 years in the future is in fact 
in the past (this helps solve the "year 2000" problem). 
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o An HTTP/1.1 implementation may internally represent a parsed 
Expires date as earlier than the proper value, but MUST NOT 
internally represent a parsed Expires date as later than the 
proper value. 

o All expiration-related calculations must be done in GMT. The 
local time zone MUST NOT influence the calculation or comparison 
of an age or expiration time. 

o If an HTTP header incorrectly carries a date value with a time 
zone other than GMT, it must be converted into GMT using the most 
conservative possible conversion. 

19.4 Differences Between HTTP Entities and MIME Entities 

HTTP/1.1 uses many of the constructs defined for Internet Mail (RFC 
822) and the Multipurpose Internet Mail Extensions (MIME ) to allow 
entities to be transmitted in an open variety of representations and 
with extensible mechanisms. However, MIME [7] discusses mail, and 
HTTP has a few features that are different from those described in 
MINE. These differences were carefully chosen to optimize 
performance over binary connections, to allow greater freedom in the 
use of new media types, to make date comparisons easier, and to 
acknowledge the practice of some early HTTP servers and clients. 

This appendix describes specific areas where HTTP differs from MIME. 
Proxies and gateways to strict MIME environments SHOULD be aware of 
these differences and provide the appropriate conversions where 
necessary. Proxies and gateways from MIME environments to HTTP also 
need to be aware of the differences because some conversions may be 
required. 

19.4.1 Conversion to Canonical Form 

MIME requires that an Internet mail entity be converted to canonical 
form prior to being transferred. Section 3.7.1 of this document 
describes the forms allowed for subtypes of the "text" media type 
when transmitted over HTTP. MIME requires that content with a type of 
"text" represent line breaks as CRLF and forbids the use of CR or LF 
outside of line break sequences. HTTP allows CRLF, bare CR, and bare 
LF to indicate a line break within text content when a message is 
transmitted over HTTP. 

Where it is possible, a proxy or gateway from HTTP to a strict MIME 
environment SHOULD translate all line breaks within the text media 
types described in section 3.7.1 of this document to the MIME 
canonical form of CRLF. Note, however, that this may be complicated 
by the presence of a Con tent -Encoding and by the fact that HTTP 
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allows the use of some character sets which do not use octets 13 and 
10 to represent CR and LF, as is the case for some multi-byte 
character sets. 

19.4.2 Conversion of Date Formats 

HTTP/1. 1 uses a restricted set of date formats (section 3.3.1) to 
simplify the process of date comparison. Proxies and gateways from 
other protocols SHOULD ensure that any Date header field present in a 
message conforms to one of the HTTP/1.1 formats and rewrite the date 
if necessary. 

19.4.3 Introduction of Con tent -Encoding 

MIME does not include any concept equivalent to HTTP/1. Ts Content- 
Encoding header field. Since this acts as a modifier on the media 
type, proxies and gateways from HTTP to MIME-compl iant protocols MUST 
either change the value of the Content-Type header field or decode 
the entity-body before forwarding the message. (Some experimental 
applications of Content-Type for Internet mail have used a media-type 
parameter of " ; conversions=<con tent -cod ing>" to perform an equivalent 
function as Content-Encoding. However, this parameter is not part of 
MIME. ) 

19.4.4 No Content-Transfer-Encoding 

fflTP does not use the Con tent -Transfer-Encoding (CTE) field of MIME. 
Proxies and gateways from MIME-compl iant protocols to HTTP MUST 
remove any non-identity CTE ( "quo ted-print able" or "base64") encoding 
prior to delivering the response message to an HTTP client. 

Proxies and gateways from HTTP to MIME-compl iant protocols are 
responsible for ensuring that the message is in the correct format 
and encoding for safe transport on that protocol, where "safe 
transport" is defined by the limitations of the protocol being used. 
Such a proxy or gateway SHOULD label the data with an appropriate 
Content-Transfer-Encoding if doing so will improve the likelihood of 
safe transport over the destination protocol. 

19.4.5 HTTP Header Fields in Multipart Body-Parts 

In MIME, most header fields in multipart body-parts are generally 
ignored unless the field name begins with "Content-". In HTTP/1. 1, 
multipart body-parts may contain any HTTP header fields which are 
significant to the meaning of that part. 
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19.4.6 Introduction of Transfer-Encoding 

HTTP/1.1 introduces the Transfer-Encoding header field (section 
14.40). Proxies/gateways MUST remove any transfer coding prior to 
forwarding a message via a MIME-compl iant protocol. 

A process for decoding the "chunked" transfer coding (section 3.6) 
can be represented in pseudo-code as: 

length := 0 

read chunk-size, chunk-ext (if any) and CRLF 
while (chunk-size > 0) i 

read chunk-data and CRLF 

append chunk-data to entity-body 

length := length + chunk-size 

read chunk-size and CRLF 

I 

read entity-header 

while (entity-header not empty) j 

append entity-header to existing header fields 

read entity-header 

I 

Content-Length := length 

Remove "chunked" from Transfer-Encoding 

19.4.7 MIME-Version 

HTTP is not a MIME-compl iant protocol (see appendix 19.4). However, 
HTTP/1.1 messages may include a single MIME-Version general -header 
field to indicate what version of the MIME protocol was used to 
construct the message. Use of the MIME-Version header field indicates 
that the message is in full compliance with the MIME protocol. 
Proxies/gateways are responsible for ensuring full compliance (where 
possible! when exporting HTTP messages to strict MIME environments. 

MIME-Version = "MIME-Version" ":" 1*DIGIT * 1*DIGIT 

MIME version "1.0" is the default for use in HTTP/1.1. However, 
HTTP/1.1 message parsing and semantics are defined by this document 
and not the MIME specification. 

19.5 Changes from HTTP/1.0 

This section summarizes major differences between versions HTTP/1.0 
and HTTP/1.1. 
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19.5.1 Changes to Simplify Multi-homed Web Servers and Conserve IP 
Addresses 

The requirements that clients and servers support the Host request- 
header, report an error if the Host request -header (section 14.23) is 
missing from an HTTP/1.1 request, and accept absolute URIs (section 
5.1.2) are among the most important changes defined by this 
specification. 

Older HTTP/1.0 clients assumed a one-to-one relationship of IP 
addresses and servers; there was no other established mechanism for 
distinguishing the intended server of a request than the IP address 
to which that request was directed. The changes outlined above will 
allow the Internet, once older HTTP clients are no longer common, to 
support multiple Web sites from a single IP address, greatly 
simplifying large operational Web servers, where allocation of many 
IP addresses to a single host has created serious problems. The 
Internet will also be able to recover the IP addresses that have been 
allocated for the sole purpose of allowing special -purpose domain 
names to be used in root-level HTTP URLs. Given the rate of growth of 
the Web, and the number of servers already deployed, it is extremely 
important that all implementations of HTTP (including updates to 
existing HTTP/1.0 applications) correctly implement these 
requirements: 

o Both clients and servers MUST support the Host request -header. 

o Host request -headers are required in HTTP/1.1 requests. 

o Servers MUST report a 400 (Bad Request) error if an HTTP/1 . 1 
request does not include a Host request -header. 

o Servers MUST accept absolute URIs. 
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19. 6 Additional Features 

This appendix documents protocol elements used by some existing HTTP 
implementations, but not consistently and correctly across most 
HTTP/1.1 applications. Implementers should be aware of these 
features, but cannot rely upon their presence in, or interoperability 
with, other HTTP/1.1 applications. Some of these describe proposed 
experimental features, and some describe features that experimental 
deployment found lacking that are now addressed in the base HTTP/1.1 
specification. 

19.6.1 Additional Request Methods 

19.6.1.1 PATCH 

The PATCH method is similar to PUT except that the entity contains a 
list of differences between the original version of the resource 
identified by the Request-URI and the desired content of the resource 
after the PATCH action has been applied. The list of differences is 
in a format defined by the media type of the entity (e.g., 
"application/dif f") and MUST include sufficient information to a 1 low 
the server to recreate the changes necessary to convert the original 
version of the resource to the desired version. 

If the request passes through a cache and the Request-URI identifies 
a currently cached entity, that entity MUST be removed from the 
cache. Responses to this method are not cachable. 

The actual method for determining how the patched resource is placed, 
and what happens to its predecessor, is defined entirely by the 
origin server. If the original version of the resource being patched 
included a Content -Vers ion header field, the request entity MUST 
include a Derived-From header field corresponding to the value of the 
original Con tent -Vers ion header field. Applications are encouraged to 
use these fields for constructing versioning relationships and 
resolving version conflicts. 

PATCH requests must obey the message transmission requirements set 
out in section 8. 2. 

Caches that implement PATCH should invalidate cached responses as 
defined in section 13.10 for PUT. 

19.6.1.2 LINK 

The LINK method establishes one or more Link relationships between 
the existing resource identified by the Request-URI and other 
existing resources. The difference between LINK and other methods 
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allowing links to be established between resources is that the LINK 
method does not allow any message-body to be sent in the request and 
does not directly result in the creation of new resources. 

If the request passes through a cache and the Request-URI identifies 
a currently cached entity, that entity MUST be removed from the 
cache. Responses to this method are not cachable. 

Caches that implement LINK should inval i date cached responses as 
defined in section 13.10 for PUT. 

19.6.1.3 UNLINK 

The UNLINK method removes one or more Link relationships from the 
existing resource identified by the Request-URI. These relationships 
may have been established using the LINK method or by any other 
method supporting the Link header. The removal of a link to a 
resource does not imply that the resource ceases to exist or becomes 
inaccessible for future references. 

If the request passes through a cache and the Request-URI - identifies 
a currently cached entity, that entity MUST be removed from the 
cache. Responses to this method are not cachable. 

Caches that implement UNLINK should invalidate cached responses as 
defined in section 13.10 for PUT. 

19.6.2 Additional Header Field Definitions 

19.6.2. 1 Alternates 

The Alternates response-header field has been proposed as a means for 
the origin server to inform the client about other available 
representations of the requested resource, along with their 
distinguishing attributes, and thus providing a more reliable means 
for a user agent to perform subsequent selection of another 
representation which better fits the desires of its user (described 
as agent-driven negotiation in section 12). 
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The Alternates header field is orthogonal to the Vary header field in 
that both may coexist in a message without affecting the 
interpretation of the response or the available representations. It 
is expected that Alternates will provide a significant improvement 
over the server-driven negotiation provided by the Vary field for 
those resources that vary over common dimensions like type and 
language. 

The Alternates header field will be defined in a future 
specification. 

19.6.2.2 Content-Version 

The Content-Version entity-header field defines the version tag 
associated with a rendition of an evolving entity. Together with the 
Derived-From field described in section 19.6.2.3, it allows a group 
of people to work simultaneously on the creation of a work as an 
iterative process. The field should be used to allow evolution of a 
particular work along a single path rather than derived works or 
renditions in different representations. 

Content-Version = "Con tent -Version** ":" quoted-string 

Examples of the Content-Version field include: 

Content-Version: "2.1.2" 
Content-Version: "Fred 19950116-12:26:48" 
Content-Version: "2. 5a4-omega7" 

19.6.2.3 Derived-From 

The Derived-From entity-header field can be used to indicate the 
version tag of the resource from which the enclosed entity was 
derived before modifications were made by the sender. This field is 
used to help manage the process of merging successive changes to a 
resource, particularly when such changes are being made in parallel 
and from multiple sources. 

Derived-From = "Derived-From" ":" quoted-string 

An example use of the field is: 

Derived-From: "2. 1. 1" 

The Derived-From field is required for PUT and PATCH requests if the 
entity being sent was previously retrieved from the same URI and a 
Content-Version header was included with the entity when it was last 
retrieved. 
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19.6.2.4 Link 

The Link entity-header field provides a means for describing a 
relationship between two resources, generally between the requested 
resource and some other resource. An entity MAY include multiple Link 
values. Links at the metainformation level typically indicate 
relationships like hierarchical structure and navigation paths. The 
Link field is semantical ly equivalent to the <LINK> element in 
HTML. [5] 

Link = "Link" #("<" URI ">" *( link-param ) 

link-param = ( ( "rel" "=" relationship ) 

I ( "rev" "=" relationship ) 
I ( "title" "=" quoted-string ) 
I ( "anchor" "=" <"> URI <"> ) 
I ( link-extension ) ) 

link-extension = token [ "=" ( token I quoted-string ) ] 

relationship = sgml-name 

I ( <"> sgml-name *( SP sgml-name) <"> ) 

sgml-name = ALPHA *( ALPHA I DIGIT I "." I "-" ) 

Relationship values are case-insensi t ive and MAY be extended within 
the constraints of the sgml-name syntax. The title parameter MAY be 
used to label the destination of a link such that it can be used as 
identification within a human-readable menu. The anchor parameter MAY 
be used to indicate a source anchor other than the entire current 
resource, such as a fragment of this resource or a third resource. 

Examples of usage include: 

Link: <http://www.cern.ch/TheBook/chapter2>; rel="Previous" 

Link: <mailto:timbl@w3.org>; rev="Made"; title="Tim Berners-Lee" 

The first example indicates that chapter2 is previous to this 
resource in a logical navigation path. The second indicates that the 
person responsible for making the resource available is identified by 
the given e-mail address. 

19.6.2.5 URI 

The URI header field has, in past versions of this specification, 
been used as a combination of the existing Location, Content- 
Location, and Vary header fields as well as the future Alternates 
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field (above). Its primary purpose has been to include a list of 
additional URIs for the resource, including names and mirror 
locations. However, it has become clear that the combination of many 
different functions within this single field has been a barrier to 
consistently and correctly implementing any of those functions. 
Furthermore, we believe that the identification of names and mirror 
locations would be better performed via the Link header field. The 
URI header field is therefore deprecated in favor of those other 
fields. 

URI-header = "URI" ":" 1#( URI ) 

19.7 Compatibility with Previous Versions 

It is beyond the scope of a protocol specification to mandate 
compliance with previous versions. HTTP/1. 1 was deliberately 
designed, however, to make supporting previous versions easy. It is 
worth noting that at the time of composing this specification, we 
would expect commercial HTTP/1.1 servers to: 

o recognize the format of the Request-Line for HTTP/0.9, 1.0, and 1.1- 
requests; 

o understand any valid request in the format of HTTP/0. 9, 1.0, or 

i.i; 

o respond appropriately with a message in the same major version used 
by the client. 

And we would expect HTTP/1.1 clients to: 

o recognize the format of the Status-Line for HTTP/1.0 and 1.1 
responses; 

o understand any valid response in the format of HTTP/0.9, 1.0, or 
1. 1. 

For most implementations of HTTP/1.0, each connection is established 
by the client prior to the request and closed by the server after 
sending the response. A few implementations implement the Keep-Alive 
version of persistent connections described in section 19.7.1.1. 
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19.7.1 Compatibility with HTTP/1.0 Persistent Connections 

Some clients and servers may wish to be compatible with some previous 
implementations of persistent connections in HTTP/1.0 clients and 
servers. Persistent connections in HTTP/1.0 must be explicitly 
negotiated as they are not the default behavior. HTTP/1.0 
experimental implementations of persistent connections are faulty, 
and the new facilities in HTTP/1.1 are designed to rectify these 
problems. The problem was that some existing 1.0 clients may be 
sending Keep-Alive to a proxy server that doesn't understand 
Connection, which would then erroneously forward it to the next 
inbound server, which would establish the Keep-Alive connection and 
result in a hung HTTP/1.0 proxy waiting for the close on the 
response. The result is that HTTP/1.0 clients must be prevented from 
using Keep-Alive when talking to proxies. 

However, talking to proxies is the most important use of persistent 
connections, so that prohibition is clearly unacceptable. Therefore, 
we need some other mechanism for indicating a persistent connection 
is desired, which is safe to use even when talking to an old proxy 
that ignores Connection. Persistent connections are the default for 
HTTP/1.1 messages; we introduce a new keyword (Connection: close) for 
declaring non-persistence. 

r The following describes the original HTTP/1.0 form of persistent 
connections. 

When it connects to an origin server, an HTTP client MAY send the 
% Keep-Alive connect ion- token in addition to the Persist connection- 

token: 

Connection: Keep-Alive 

An HTTP/1.0 server would then respond with the Keep-Alive connection 
token and the client may proceed with an HTTP/1.0 (or Keep-Alive) 
persistent connection. 

An HTTP/1.1 server may also establish persistent connections with 
HTTP/1.0 clients upon receipt of a Keep-Alive connection token. 
However, a persistent connection with an HTTP/1.0 client cannot make 
use of the chunked transfer-coding, and therefore MUST use a 
Content-Length for marking the ending boundary of each message. 

A client MUST NOT send the Keep-Alive connection token to a proxy 
server as HTTP/1.0 proxy servers do not obey the rules of HTTP/1.1 
for parsing the Connection header field. 



Fielding, et. al. 



Standards Track 



[Page 161] 



RFC 2068 



.HTTP/1. I 



January 1997 



19. 7. 1. 1 The Keep-Al ive Header 

When the Keep-Al ive connect ion- token has been transmitted with a 
request or a response, a Keep-Alive header field MAY also be 
included. The Keep-Alive header field takes the following form: 

Keep-Al ive-header = "Keep-Alive" " 0# keepal ive-param 

keepal ive-param = param-name value 

The Keep-Alive header itself is optional, and is used only if a 
parameter is being sent. HTTP/1.1 does not define any parameters. 

If the Keep-Alive header is sent, the corresponding connection token 
MUST be transmitted. The Keep-Alive header MUST be ignored if 
received without the connection token. 
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TITLE: NETWORK BASED CLASSIFIED INFORMATION SYSTEMS 
FIELD OF INVENTION 

This invention relates to network based classified information systems, to methods of 
5 automatically building searchable databases of classified information derived from web pages 
posted on a network, and, to web pages for use in such systems and methods. 

The information systems and databases of most relevance to this invention are those which 
include classified product and service catalogues similar to the Yellow Pages telephone books, 
1 0 contact indexes similar to the White Pages telephone books, and/or subject indexes similar to 
Library catalogues. Such information systems and databases typically include sets of 
associated classification, contact and/or geographic items of information. For convenience, 
classification, contact and/or geographic Information will be hereinafter called CCG-data. 

15 The networks with which this invention is concerned are the worldwide public 
computer/communications network commonly known as the Internet and private networks - 
sometimes called intranets - which allow common access to markup documents on computers 
connected to the network. Markup documents are text files prepared using various markup 
languages such as Hypertext Markup Language (HTML) and Extensible Markup Language 

20 (XML) which are implementations (or dialects) of the Standard Generalised Markup Language 
(SGML). The system of accessible files on the Internet is called the World Wide Web (WWW) 
and the markup documents themselves are commonly called Nveb pages'. A web page is said 
to be 'posted' on a network when it is stored on computer-readable media of a host network 
computer as a file which is generally accessible to network users. A web page is transported 

25 from the host computer to a requesting computer through intermediate network computers as 
a computer-readable signal embodied in a carrier wave. Though this invention is not limited to 
Internet based information systems, these terms are used for convenience. 

BACKGROUND TO THE INVENTION 

30 It has been estimated that there are about 100 million web pages on the Internet and that the 
number is doubling every two years. Many of these pages include information concerning 
commercially offered goods and services and often include contact details. But the difficulty of 
locating such information is increasing faster than the growth in the number of web pages. 

35 To assist network users locate web pages of interest, certain network service providers create 
indexes (or databases) of the contents of web pages posted (stored on computer readable 
media so as to be generally accessble) on the network and provide 'search engines* to use 
the indexes. These indexes are often created automatically by the use of 'web crawlers* which 
(i) interrogate computer after computer on the network to locate successive web pages and (i») 

40 index the words in each web page encountered against the network address (eg Internet 
Protocol Address or IPA) and filing system path or universal resource locator (URL) at which 
the web page is accessible. Hereinafter the terms URL and URI (Uniform Resource Identifier) 
are taken to be identical in meaning and to signify network addresses and filing system paths. 
Usually, the indexes consist of a list of unique words with each word having an associated list 

45 of URLs of the web pages wherein the word was found to occur during interrogation. The URL 
serves as a 'hyperlink* which, if selected by a user/searcher, results in the associated web 
page being automatically transmitted from the computer where it is posted on the network to 
the user/searcher's computer where it may be displayed or otherwise processed. The sending 
and receiving of files in this way is greatly assisted by user interface programs called *web 

50 browsers' (or more simply, 'browsers') such as Netscape and Microsoft Internet Explorer. 
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The search for web pages of interest using search engines leaves much to be desired: 

• simple searches (those using a few keywords in simple combinations) often yield far too 
many web page references (URLs) to permit them to be interrogated one-by-one. 

5 • complex searches (those using many keywords and/or complex Boolean expressions) 
require considerable expertise to undertake, 

• even using optimum search criteria, many irrelevant web pages are referenced because of 
inconsistent use of terminology by those who author the original web pages, 

• even using optimum search criteria, many relevant pages are missed, again because of 
1 0 inconsistent use of terminology by web page authors, and 

• because items of information included in the body of web pages cannot be 'understood' or 
associated in useful ways by web crawlers; that is recognised as, say, a surname, a street 
name, a geographic locality, or type of goods or services and, say, a surname strongly 
associated with a street name, a geographic locality, or a type of goods or service. 

15 The result is that information provided by search engines from databases which are 
automatically compiled using web crawlers is a very poor equivalent of the common Yellow 
Pages and White Pages directories which serve the telephone industry (though these 
directories are not, of course, automatically compiled from web pages). 

20 In an attempt to improve the usefulness of automatically compiled network databases, some 
search engine providers make use of infoonation contained in URLs, such as the country code 
and top level domain name codes such as 'com*. 'edu\ 'net' and 'org 1 which is sometimes used 
to signify the subject matter of web pages. It has been proposed to add more content 
classifying codes to URLs (eg, "chenf to signify chemical subject matter) to allow specialised 

25 databases - national; commercial, chemical, etc - to be generated. However, this proposal 
has serious drawbacks: 

• URLs are Internet addresses and it is in principle undesirable to confuse the address 
function of a URL wfth that of representing a list of web page classifications or contact 
details. 

30 • A URL is an inappropriate container of multiple web page classification codes and contact 
details because the length of the URL would cause it to become unwieldy as an Internet 
address. 

• Including in a URL classification codas drawn from a list of thousands of codes would 
compromise the mnemonic quality of Internet addresses such as "www.yellowpages.comV 

35 • There is substantial overlap in the subject matter contained in web pages having the 
various top level domain name codes. 

• There is no consensus on, or standard for t content classification codes in URLs. 

Another proposal to add content classification data to web pages has arisen from the wish to 
40 identify pages containing material that may be offensive to some viewers, or should not be 
accessed by minors. The Platform for Internet Content Selection (PICS) (see 
http://www.w3.org/pubyWWW/PICS and other documents at www w3.org) is a web page 
ratings standard similar in principle to the ratings systems for motion pictures. This system 
allows page authors to •internally* self dassify their pages through use of the "<meta...>" 
45 HTML element Alternatively, "extemar PICS ratings of web pages may be obtained from 
ratings service providers accessed each time a URL is selected. In practice, the ratings service 
providers have adopted very limited range of web page classifications. For example, Ararat 
Software's Commercial Rating System (see httpi//www.ararat com.ratings/araratlO.html) 
provides just 5 categories of web page content: commercial content, technical/customer 
50 support, ordering information, downloading information and contact information. In other 
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examples, CyberPatrol (http://www.microsys.com/pic^^ provides 16 categories, 

the Recreational Software Advisory Council (http://www.reac.org/faq.html) provides 4 
categories, SafeSurf (http://wvww.safesurf.com/ssplan.htm) provides 11 categories and 
Vancouver Webpages Rating Service (http://vancouver-webpages.eom/VWP1.0/ provides 11 
5 categories. None of the categories provide classification of web pages by industry, service, 
product or subject with sufficient specificity to be useful when searching for web pages. 
Rather the categories are intended to prevent web browsers from displaying web pages 
unsuitable for particular types of web browser users. Such rating systems are not intended to 
be used for the automated creation of Yellow or White pages like databases from web pages 
1 0 and are unsuitable for that purpose because they can not represent contact details. Further, 
the ratings data may only be encoded in the <meta...> element in the <head> of an HTML 
document drastically limiting the type and usefulness of the data that can be encoded. 

Another proposal for classifying the content of web pages, the - Meta Content Framework* 
15 (MCF - see httpi/mcf.research.apple.com/mcf.htmn, requires the content of web pages to be 
classified and the classification data to be held in a separate non-HTML data file with a MJME 
type of text/mcf. Storing data in non-HTML encoded documents which describes the content of 
HTML encoded documents is a technical and economic barrier to the adoption by search 
engine providers of the proposal. The MCF proposal is thus entirely unsuited to the automated 
20 creation of Yellow or White pages See databases from HTML encoded web pages (MIME type 
text/html) because data stored according to the MCF proposal is not stored in HTML encoded 
web pages. 

The -Electronic Business Card*, vCard. (see "vCard The Electronic Business Card* Version 
25 2.1. versit Consortium Specification, Sept 18. 1996 or ftp7/ds .intemic.net/intemet-drafts/draft- 
ietf-asid-mime-vcard-01.txt) uses non-HTML data file (MIME Content Types of -text/plain' or 
the non-standard "text/X-vCarcT) containing contact information equivalent to an extended 
White Pages entry which can be exchanged on a network using Simple Mail Transfer Protocol 
(SMTP) or using HTTP. It can be associated with a web page by use of a URL in the web page 
30 which refers to the vCard information (eg <a href="http^/www.thing.com/vCand.vcf^M^ 
vCard</a>). Version 2.1 vCard standard data file format (published 18 September 1996) 
provides for the inclusion of many Items of contact information. The vCard specification 
recommends that where possible, there should be consistent mapping of vCard property 
names to HTML "<input>* element attribute names (eg vCard property name TITLE" maps to 
35 HTML "<input name= titled"). The intention is to facilitate the transfer of vCard data into web 
page input forms by pasting from a clipboard or by dragging from other computer applications. 
The VCard proposal is unsuited to the automated creation of Yellow or White pages like 
databases from HTML encoded web pages because data stored according to the VCard 
proposal is not stored in HTML encoded web pages. 

40 

The inclusion of classified information in separate documents (such as Meta Content files or 
vCards) has the disadvantage that there is necessarily much duplication of data and 
coordination of modifications between the separate documents and the web pages. This must 
be done to allow a person who has accessed a web page using an HTML compliant browser 

45 to determine whether it is worth calling up the associated file or vice versa. Also, to allow 
portions of web pages to be classified, web page contextual information would have to be 
duplicated in the separate document vCards in particular do not provide this functionality. 
Another disadvantage is that non-HTML documents such as vCards contain no details as to 
how the data they contain is to be displayed. In the display of HTML documents the position. 

50 font size, colour of the text and other elements of the document are of great importance. The 
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restriction of address data in a vCard to untagged ordinally organised fields is inflexible. For 
example, multiple instances of extended parts of the address are not possible. Also 
components of names, addresses and telephone numbers and so forth are insufficiently 
identified. 

5 

The Online Computer Library Center Inc (OCLC, Dublin, Ohio, USA) proposal, known as the 
"Dublin Core", proposes to classifying scholarly web pages by subject (topic of the work, or 
keywords that describe the content of the work), title, author, publisher, other agent, date, 
object type (genre of the object such as home page, novel, poem etc), form, identifier, source, 

10 language, relationship and coverage (spatial and temporal) (see 
http://www.odc org:5046/--weibel/html-meta.html and other documents at www.odc.org). This 
proposal does not include industry, service, product or subject classifications. It also does not 
include contact details. Names such as that of the author are not specified in sufficient detail to 
avoid ambiguities such as which is the author's first and last names. The proposal specifies 

15 that the details are encoded using the <meta...> element in the <head> of web pages. The 
proposal is unsuited to the automated creation of Yellow or White pages like databases from 
web pages because the proposal does not provide for classification of web pages and does 
not provide adequate contact details. Further, the use of keywords for describing the content 
of the work adds very little to the effectiveness of indexing of web pages since the web pages 

20 are usually indexed on every word of their content and most often the key words would simply 
be a duplication of words already contained in the document 

It has also been proposed to use the Dewey Decimal System (see 
http7/orc.rsch.odc.org:6109/evaLdc.html and http://orc.fsch.oclc.ong:6l09/bintro.htmO to rank 

25 electronic documents against a Dewey Decimal subject classification. The proposal suggests 
automatically assigning Dewey Decimal subject classification codes to documents during 
automated indexing and cataloguing but does not specify the exact nature of the assignment 
although it is implied that the codes are stored separately from the documents. The proposal 
admits that such automated classification is less satisfactory than human classification. The 

30 proposal is unsuited to the automated creation of Yellow or White pages like databases from 
web pages because the accuracy of classification is inadequate, does not provide for inclusion 
of industry, service or product classifications and does not provide for inclusion of contact 
details. Deriving a subject classification code from an analysis of every word and phrase in a 
web page is computationally expensive. 

35 

The HTML 3.0 standard (see page 23 of the www.w3.org document "draft-4etf-html-6pecv3- 
OO.txT) provides 'class" as an attribute of almost all HTML "<body>" elements. The "class* 
attribute is intended to be used with style sheets. Style sheets provide a means by which the 
display of HTML documents may be altered to suit the needs of different classes of browser 

40 users. For example, <drv dass^appendix^ could be used to define a division that acts as an 
appendix, <h2 class-*section a > could be used to define a level 2 header that acts as a section 
header, although, of course, any string of characters could be defined for those purposes. The 
"class" attribute, although never having been suggested for holding goods and services 
classifications, is not suited for such a use as it is, in any case, undesirable to confuse the style 

45 sheet function of the "class" attribute. 

The HTML 3.0 and earlier standards provided the HTML elements "<person>" and •<address>* 
but do not specify the form of the content or method of validating the content of those 
elements. A person's name may be written as first name followed by last name or last name 
50 followed by first name. Similarly, different conventions exist for writing addresses. Similar 
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ambiguities arise in the ill defined format of the HTML elements "< pe rson>" and "<address>". 
As such they are of little use in the automatic compilation of searchable databases. 

The XML language (see: http:/Aextuality.corn/sgml-erb/WD-xml.html) was developed to extend 
5 HTML so that software vendors can add new elements and new element attributes to HTML 
which are not specifically defined in any HTML standard. The intention is to ensure that all new 
elements and attributes could be parsed by all XML parsers even if the new elements held no 
significance for any particular XML parser. However, like HTML. XML does not provide a 
standard for the representation of industry, service, product or subject classification, contact or 
1 0 geographic location details within an web page. 

Of course, many useful databases of the Yellow Pages or White Pages type are made 
available by service providers on networks, but they are not compiled automatically by using 
web crawlers to scan HTML web pages posted on a network. For example, 

15 http://www.yellowpages.com.au and httpJAvww.mcp.com provide classified advertisements of 
the Yellow Pages type with links to the web pages of paying advertisers or subscribers. There 
are also directories of email addresses which approximate the White Pages directories, listing 
the names of individuals and organisations and contact details, (eg http://www.bigbook.com 
and http://query1.whowhere.com). However, these email directories require listers to manually 

20 add their directory entries and enquirers to be aware of and to find the directory enquiry web 
page. They cannot be automatically generated by scanning web pages using web crawlers 
since there is no adequate mechanism to relate email addresses to the names of people and 
organisations and their other contact details which may also exist in the same web page. 

25 OBJECTIVES OF THE INVENTION 

The general object of the invention is to provide improved methods for automatically building 
searchable databases of classification, contact, and/or geographical information by using web 
crawlers to interrogate web pages posted on a network. [For convenience, this information is 
collectively referred to as CCG-data]. 

30 

Other non-essential objectives are to provide methods for including and/or displaying CCG- 
data within web pages accessed by browsers, for automatically extracting CCG-data from web 
pages posted on a network and for using the same, and/or to provide methods for searching 
automatically compiled databases using such data. 

Another subsidiary objective of the invention is to provide a new form of web page which is 
better suited to the automatic compilation (using web crawlers) of databases constructed by 
the automatic scanning of many such pages posted on a network. 

40 OUTLINE OF THE INVENTION 

The invention is based upon the realisation that highly useful databases can be automatically 
built by successively interrogating web pages posted on a network if one or more HTML 
encoded CCG phrases are included in the web pages. A CCG phrase is one containing CCG- 
data in a form which is directly accessible and identifiable. CCG phrases may also include one 
45 or more items which provide the web page author with control over how the CCG-data is 
applied to the database. 

Data duplication can be reduced if some of the CCG-data in the coded CCG phrases on be 
displayed by browsers as well as being used to update databases. Errors due to inexactly 
50 duplicated data are also eliminated. Accordingly, it is envisaged that CCG phrases may include 
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one or more items which provide the web page author with control over how the CCG-data is 
displayed by a browser. 

HTML (including version 2 and version 3) and XML are evolving applications (sub-sets or 
5 dialects) of ISO Standard 8879 1986 known as Standard Generalised Markup Language 
(SGML). HTML, in large part, is a language used to describe how text (unstructured data) and 
graphics is to be formatted for display. The HTML language consists of a finite number of 
"elements" (for example; *<BR>' where "BR" is the element name, also called the tag name) 
which may contain "attributes" (for example; "<DL COMPACTS where "COMPACT is an 

10 attribute named "COMPACT*) and may contain values associated with attributes (for example; 
•<FONT SIZE=+1>" where +1 is the attribute value of the attribute named "SIZE"). XML is a 
language used to describe structured data. The XML language is similarly composed of 
elements, attributes and values with a similar syntax to HTML but unlike HTML the element 
names which may be used are not restricted and the meaning of the XML data may be 

1 5 interpreted in any convenient manner. While the XML language is mute about how data 
described by XML is to be formatted for display, the data may be used by computer programs 
for any purpose including description of how XML coded data is displayed. However, due to its 
historic importance in connection with web pages, the term "HTML - is herein used to refer to all 
markup languages which are subsets or complete sets of the SGML language, in particular, 

20 the term "HTML encoded CCG phrase" and the synonymous term "CCG phrase" are herein 
used to refer to CCG-data encoded in a subset or complete set of the SGML language. 
Herein, a Veb page" is a document adapted to be or actually accessible through a network 
and encoded in a subset or complete set of the SGML language. 

25 For convenience, CCG items in HTML encoded CCG phrases, whether they are syntactically 
represented as elements or as attributes, will be referred to hereinafter as CCG attributes. 

A CCG phrase includes at least one of the following identifiable types of CCG-data attributes: 

• industry, product, service, and/or subject classifications, 

30 • contact categories, contact person(s) and/or organisation(s) names, titles or 
associations, contact detals Including physical and postal addresses, telephone and 
fax numbers, email and Internet or network addresses or locations, public keys, and 

• geographic location details. 

35 A CCG phrase may also include any of the following identifiable types of CCG control 
attributes: 

• database control attributes to indicate which parts of the data are to be used to 
update databases, and 

• display control attributes to indicate how browsers are to display the data. 

40 

By virtue of occurring in the same CCG phrase, a plurality of CCG-data attributes are 
associated with each other. 

By virtue of their occurrence in the same CCG phrase, CCG-data attributes are idententified as 
45 a set of associated attributes. However the degree of association between attributes can be 
controlled by the inclusion in the phrase of database control attributes. 

The start and end of CCG phrases should be identifiable to dearly distinguish these phrases 
from other data. To Identify the beginning and end of a CCG phrase, at least one HTML 
50 element should have a CCG specific HTML element name or CCG specific attribute name or 
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CCG specific value. Each CCG attribute may consist, with or without other incidental 
characters, of a CCG attribute name and/or a CCG value or values. Preferably, each CCG 
phrase is contained in the *<body>" of the web page. 

5 Two examples of a CCG specific HTML element are: *<CCG ...>" or '<CCG ... />" or 
"<CCG>...</CCG>". (Where a CCG phrase is coded in XML. the elements "<XML>" and 
*</XML>" may also be needed at the start and end of the CCG phrase.) A less satisfactory 
example is: '<!-CCG ...-> where the characters "CCG" after HTML comment element name 
"!-" are used to signify that the comment contains CCG-data. An example of the use of a CCG 
10 specific attribute name is: *<START CCG>"..."<END CCG>". An example of the use of a CCG 
specific value is: "<START TYPE='CCG>'...*<END TYPE='CCG>*. Obviously, other 
character strings could be substituted for the element name, element attribute name or 
element attribute value "CCG* string of the examples. 

15 The codes "<CCG ...>" and "<CCG .../>" are compatible with most HTML specifications, but 
being non-standard HTML, most web browsers do not display any text or attributes (eg 
PQ="AQD") within the angle brackets *<" and ">". These codes are preferred where display of 
the CCG data is not required and compatibility with older browsers is required (eg CCG 
phrases containing only classification values). 

20 

From one aspect, therefore, the invention comprises a web page for posting on a network, the 
web page being characterised by the inclusion of at least one CCG phrase in the *<body>" of 
the page, the CCG phrase being such that the CCG attributes contained therein are 
accessible and identifiable by (i) HTML compliant editors and/or (ii) HTML compliant web 
25 crawlers for the automatic construction of databases of classified information, and/or (iii) HTML 
compliant browsers for display on the computer screens of network users. 

From another aspect, the invention comprises a method of constructing web pages of the 
above described type. The web pages may be constructed on digital computers using simple 

30 text editors such as Microsoft Windows Notepad, or preferably, purpose built human controlled 
editors or automated composing programs which embody knowledge of HTML and CCG 
syntax and grammar. Which ever process is used. CCG attributes are selected and inserted, 
modified, deleted and/or organised to form a valid CCG phrases in HTML encoded documents 
and the documents are posted on computer readable storage devices of computers connected 

35 to a computer network so that the documents are generally available to computers on the 
network. 

From another aspect, the invention comprises a method of populating a database with CCG- 
data extracted from web pages. Web pages posted on a network are successively retrieved by 

40 a digital computer program (eg: a web crawler) and CCG phrases contained therein are 
identified and at least some of the CCG attributes found within the CCG phrases are extracted. 
The CCG attribute names are used to determine the type of data in the associated values. 
Generally the CCG attributes of interest are those relating to classification, contact and 
geographic data and database update controls while the attributes of little or no of interest in 

45 relation to database updating are those relating to display controls. Of course, the CCG-data 
extracted need only be that relevant to Ihe particular database being updated. For example, 
one database may have been designed to index only web page classifications and URLs while 
another database may have been designed to index only contact details. Databases also differ 
in their internal representation of data and means of associating data. For example, some use 
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Hat file* tables, others use pointers to data to create network associations while others use 
hashing and buckets. 

The conventional nomenclature differs considerably between different types of database. 
5 Depending on the particular database nomenclature, data of the same type is said to be stored 
in table columns, fields, attributes and properties. The terms column and field are somewhat 
related to the physical representation of the data in files while attribute and property is more 
related to the logical representation of data. To avoid confusion, with the terms "HTML 
attribute*, - CCG attribute' or just •attribute", hereinafter a database property means both a type 
10 of data stored in the database and a place in the database where data of the same type is 
stored. Database properties are referred to by a name ("property narne -) or similar reference 
and contain values. For example, a database property with the name "City name* and which 
contains values which are all the names of cities may be defined as a "City name" type 
database property. 

15 

Whichever style of database is used, it is preferred that the database update program relate 
the CCG attributes to corresponding database properties used by the database update 
process so that the database property values are updated with CCG values in a manner which 
preserves the distinctness, content and meaning of the CCG values and, preferably, preserves 
20 the CCG value associations expressed in the CCG phrase as sets of associated database 
property values of different types. 

In some cases, it is desired to know the address of the web page from which the CCG values 
were extracted. For example, the purpose of building a database might be to allow searching 

25 of the database by web page classification to provide a list URLs of web pages or URLs of 
portions of web pages which contain matching CCG classifications. The URLs could then be 
inserted in an HTML document and transmitted to a web browser as a list of references to web 
pages matching a search expression. In that example, associating the URL of a web page or 
the URL of a portion of a web page with the CCG values extracted from the same web page or 

30 web page portion is important and the URL or means of reconstructing it must be available and 
supplied to the database update process. In one style of database, the values of the same 
type are held separate rows in a column (property) of a database table, and pointers held in 
another column (property) are associated with the values by sharing the same table row. The 
table row constitutes a set of associated property values. Each pointer points to a bucket 

35 (block of data) containing a list of URLs or pointers to URLs held in a separate bucket or table. 
In another style of database, values of different types are held in different tables together with 
a set number, pointer or similar code which ts used to indicate which values are associated as 
members of the same set. In one variation, the values of set members are prefixed with a code 
indicating the type of value and all values are held in the same column of a table. If the 

40 purpose of the database is to hold contact data, recording the web page URL in the database 
might not be required although if the URL is not present in the database, updating changes in 
the CCG contact details contained within a web page is more difficult. Of course, one 
database may be used to record all types of CCG values contained in web pages and 
associate with each other any and aO values extracted from the same web page or even from 

45 other web pages. 

From another aspect, the invention comprises a method of searching the databases 
constructed as outlined above. These databases may be used for a variety of searching 
purposes. For example, to find web page URLs by using the association of web page URLs 
50 with industry, service, product or subject classification or a person's or organisation's name or 
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address or geographic location values or any combination thereof. In another example, the 
databases may be used to find the contact details for people or organisations by name or 
location of industry, service, product or web page subject type and so forth by using the 
association between Herns of the contact details in the database without having to retrieve web 
5 pages associated with the contact details. 

More particularly, the searching method involves finding URL references, or finding sets of 
associated database property values, from databases containing CCG-data. The method 
including steps of parsing a query phrase received from a computer network to extract query 

10 relational expressions and, from each expression, deriving a query field name, query relational 
operator and query value, determining the type of the query field by reference to its name, 
relating the query field to a corresponding database property according to type and locating 
CCG-data database property values in the database property which return a true value when 
tested against the query value using the query relational operator. Finally, the URL references 

15 or the sets of property values associated with the so located CCG-data database property 
values are extracted. 

Database queries are usually expressed in a query language in the form of a phrase or 
sentence. In query by example style enquiry systems, the user types values into input fields on 

20 a form and a program extracts the input values and uses the values to automatically compose 
a query phrase or sentence. There are many existing examples of query languages used in 
connection with databases. Generally, they consist of relational expressions (eg FiekJ=Value), 
logical expressions and grouping of relational and logical expressions by means such as 
parentheses. They may also contain sorting and output formatting expressions. Often 

25 abbreviated notation is used in the expressions such as leaving out field names or relational 
operators which are then inferred from the value in the expression or implied by default In an 
enquiry the nature and format of the output may also be implied, such as a list of URLs of web 
pages or a Ost of contact details. Whatever is the mechanism of any particular database, the 
query expression needs to be parsed and fields in the query expression, explicit, default. 

30 implied or inferred, need be related to database properties of similar type. In some styles of 
database enquiry the query expression is evaluated against each row of a table or record of a 
file to find rows or records (te a set of associated property values) which match the query 
expression. In other styles, sub-sets of the values of the properties are selected according to 
the interpretation of relational expressions in the query expression and the sub-sets are 

35 combined according to logical and grouping expressions in the query to find the sets of 
associated property values which match the query expression. Often, to make logical 
operations which combine the selected sub-sets more efficient. K is not the values which are 
selected but pointers to the values (eg Table name and table row) or unique keys (eg URLs or 
pointers to URLs) associated with the values. For example, the AND logical operator is often 

40 used to combine two lists so that only values or pointers or keys common to both Ii6ts are 
found in the combined fist Usually, the query produces a result list which is then provided to 
other processes. For example, a list of URLs of web pages is processed to produce an 
attractively formatted HTML encoded document containing the URLs and is sent to a web 
browser to allow an enquirer to retrieve interesting web pages. In another example, the contact 

45 details associated in the database with each value or pointer in the result list are retrieved from 
the database and presented as a report in the form of an HTML encoded document and is 
sent to a web browser for viewing. 

From another aspect, the invention comprises a method of displaying CCG-data contained in 
50 CCG phrases within web pages which are displayed by a web browser executing on a digital 
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computer. While a web page is loading or has loaded in a web browser, the web browser 
parses the web page and displays the text (or data) of the web page on a display device 
connected to the computer. When the web browser parser encounters CCG phrases, the web 
browser may display the CCG-data (element and/or attribute names (or translations of element 
5 and/or attribute names) and/or values) in a number of browser specific ways. For example, the 
web browser may by default not display any CCG-data, display all CCG-data. not display any 
CCG-data until a CCG display control attribute explicitly states that subsequent data should be 
displayed or display all CCG-data until a CCG display control attribute explicitly states that 
subsequent data should not be displayed. The web browser may also use CGA display 
1 0 controls specifying the size, font, position and so forth to alter the display of the CCG-data. 



DESCRIPTION OF EXAMPLES 

Having indicated the nature of the present invention, examples or embodiments thereof will 
now be described by way of illustration only. 

15 

Example 1: HTML Syntax Suitable for Representing a CCG Phrase 

The following is an example of HTML element syntax suitable for representing CCG phrases in 
which a control (e.g. •SHOW") may be "good until countermanded" and thus apply to more 
than one field: 
20 <CCG HREF="urf 

{{NAME="label" | ID="kJentifier_code"} &| {LANG= ,, language_code" 4 

CLASS=*Class_name"} 

{ 

{SET_SEPARATOR} &| 
25 {INDEX | NOINDEX) &| 

{SHOW | HIDE} &| 

{XPOS='horizontal_position_numbeO &| 
{YPOS="vertical_position_numbeO &| 
{NEWLINE}&| 
30 {ALIGN=centre | left | right | justify} &| 

{SIZE=I+/-]1 1 2 1 3 1 4 1 5 1 6 1 7} &| 
{COLOR="#rrggbb" | "colowname"} &| 
{FACE="type_face_name'> &| 

{BLINK &| BOLD A| UNDERLINE &| ITALIC &| STRIKE} &| 
35 {SUBSCRIPT | SUPERSCRIPT} &| 

{CLEAR{=left | right | aM}} 
{NORMAL} &| 

{{{CONTACT &| COPYRIGHT &| DEVELOPER} &| 
{PERSONAL &| BUSINESS &| ASSOCIATION} &| 
40 {attribute_name="attribute_value(s)"} 

} 

> 

where: the ellipsis implies optional repetition of the braced (T T) items; the braces are 
45 used to group items and are not CCG syntactic elements; (and) implies items must occur 
together T (or) implies only one item must occur; and *&f # .(and/or) implies any including none 
of the items may appear together. 



Using the syntax of this example, each CCG phrase is represented as an HTML element, the 
50 element name being "CCG" and the CCG-data (eg attribute_name= - attribute_value") and CCG 



12 

controls (eg SIZE=+1) are represented as attributes of the HTML element. Some of the 
attributes (eg SIZE) having explicit values (eg +1) and some attributes have implied values 
depending on the presence or absence in a CCG phrase (eg when the attribute BUSINESS is 
present it has the implied value of True and the implied value of False when absent). 

5 

Representation in XML syntax requires, at most, only a simple translation. All the items, such 
as "NORMAL" and "attribute_name" may remain unchanged as attributes of the element 
named "CCG" (eg <CCG ske=+1/>). However, when a CCG phrase is encoded in XML, it is 
preferred that the items are represented as XML elements. For example attribute *SIZE=-M" 
10 can be represented as element "<size>*1</size>" or "<size value=+1/>" and "NORMAL" can 
be represented as "<normal/>. 

In this example, the attributes, ID. LANG and CLASS take their meanings from HTML 3.0. The 
"urt" in HREF=*urT or may be a fink with or without destination anchor labels. For example the 

15 URL http7Avww.w3.org/docs.html does not contain a destination anchor label (or identifier) 
while http://www.w3.Org/docs.html#searching does contain the destination anchor label 
"#searching" which is intended refer to an anchor in docs.html such as <A 
NAME='searching">...</A>. There is some confusion in various HTML standards 
documentation about the distinction between the expression NAME="iabel" and the expression 

20 ID="identifier_ccde'. For most practical purposes the two expressions have the same function 
or meaning: to uniquely identify within a document a position in or portion of that document 

Database control attributes: 

"Set_separator" indicates the end of association between preceding and following data other 
25 than through the weaker mutual association with the same CCG phrase or web page; the data 
are divided into sets. "Index | Noindex" indicates that the following data are / are not to be 
indexed by a web crawler. These attributes have an implied attribute value of True' if present 
in and 'False' when absent from a CCG phrase. 

30 Display control attributes: 

"Show | Hide" indicates that a browser should show / not show the following data. Xpos and 
Ypos indicate the position (for example in pixel or physical units) on the browser screen where 
the data is to be displayed. "Newline" may be used in addition or as an alternative method of 
placing text on a browser screen. "Afign" Indicates the positioning of data on a browser screen 

35 relative to the cursor position set by "Xpos", "Ypos" or "Newtine". "Size", "Colour" and "Face- 
indicates the size, colour and type face or font of the following data when displayed on an 
browser screen. "Blink*. "Bold*. "Underfne". "Italic'. "Strike", "Superscript" and "Subscript* 
indicates that the following data should be displayed blinking, bold, underlined, italicised, struck 
through, superscripted or subscripted. "Clear" indicates that the browser screen in the region 

40 where data will be displayed should be cleared to background before displaying the following 

data. "Normal" indicates the data Is to be displayed without the "Blink" "Clear* 

characteristics. The display controls which consist of an attribute name without an explicit value 
have an implied value of True' when present and 'False' when absent. 

45 CCG-data attributes: 

"Contact &| Copyright &| Developer" indicates that the following CCG-data refers to details for 
a person or organisation and/or to the copyright owner and/or to the HTML or web page 
developer. "Personal &| Business &| Association" indicates that the following data refers to 
details for a person and/or business and/or association. The previous CCG-data attributes 

50 have an implied attribute value of True' If present in a CCG phrase or set and 'False' when 
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absent from a CCG phrase or set The attribute_name could be standard CCG attribute names 
or synonyms of standard CCG attribute names or abbreviations of CCG attribute names which 
refer to the following types of CCG attribute values where square brackets T and T surround 
suggested attribute names: 
5 • industry or service or product or subject classifications and sub-classifications: 

• classification name [CN], 

• classification codes [CCJ. 

• display only text (TEXT! 

• contact: 
10 • person: 

• courtesy title [PNC], 

• first given name [PNG], 

• other given names [PNO], 

• family name [PNF]. 
15 • name suffix [PNS], 

• qualifications [PQ], 

• associations [PA], 

• contact person title [P-T], 

• contact person role [PR]. 
20 • organisation: 

• name [ON]. 

• unit[OU], 

• identifier [OIDJ. 

• physical or post or delivery address: 

25 • type [AT] (= "PHYSICAL* &[ "POST-OFFICE" &| "POSTAL* &| "DELIVERY") 

• post office box number [AP#] 

• post office name [APN] 

• room or suite or office or unit or flat or apartment name &| number [AB#], 

• floor name &| number [ABF], 
30 • building name [ABN], 

• lane or street or road or highway number [AS#], 

• lane or street or road or highway name [ASN], 

• suburb or town or city name [ACN], 

• region or state or territory or province name [ARM], 
35 • post code [APC], 

• country or nation name [ANN], 

• telephone: 

• type P"T] (= -PREFERRED* &| VOICE" &| "MOBILE" &| 'CAR" &| "MESSAGE' 
&|"PAGER* &| "FACSIMILE* &| "MODEM" &| -ISDN" &| "VIDEO") 

40 ♦ nation or country code number [TC#1 

• trunk access number [TT#], 

• area code number {TAtf], 

• local number [TL#1, 

• email: 

45 ♦ type \ET] (= 'INTERNET | {other}), 

• mailer [EM], 

• address [EA], 

• Internet address: 

• urlflURL). 
50 • date& time: 
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• date & time from [DTF]. 

• date & time to [DT71 

• weekday from [DTWF], 

• weekday to [DTWT), 

5 • weekday time from [DTWFTj, 

• weekday time to [DTWTT], 

• time zone [DTZ]. 

• brand name [BN]. 

• public key: 

10 • keytype[KTl 

• key[K]. 
• geographical: 

• location units [GLU], 

• location [GL], 

15 • serviced region units [GLRUJ. 

• serviced region [GLR], 

Suggested attribute name [CN] is the name of an attribute associated with the attribute value 
containing 'classification name* type data. For example, the [CNJ attribute value could be the 

20 name of a proprietary or national or international or other industry classification standard such 
as the Australian and New Zealand Standard industry Classification or "ANZSICT for short or 
the U.S. Bureau of the Census Industrial Classifications (USBCIC). The associated 
classification codes [CCJ attribute value could contain the codes and/or descriptions of the 
codes of the named standard with or without modifications, deletions or extensions. For 

25 example: CN=*ANZSIC" CC='61;Road transport* or CN= % USBCICT CC='581;Hardware store". 
Service classifications such as the International Standard Classification of Occupations could 
be used. For example: CN= # ISCOO* CC= p 4430;Auctioneer* Product classifications such as the 
Harmonised Commodity Description And Coding System could be used. For example: 
CN='HSC* CC=*B411;Turbojets, turbo-propellers & other gas turbines; parts thereof For 

30 subject classifications, Dewey Decimal, and/or Universal Decimal and/or Library of Congress 
and/or Bliss and/or Colon Classification could be used. For example: CN="DDC* 
CC="577.699;Sea shore ecology - The inclusion of subject classifications provides a very 
simple, straightforward method of classifying the subject matter of an HTML document which 
could be attractive to commercially oriented copyright owners. 

35 

The text QTEXT]), person ([PNC] - [PRD. organisation ([ON] - [OID]), physical or post or 
delivery address ([AT] - [ANN]), telephone flTTl - [TL#]). email address ([ETJ - [EA]) and 
Internet address [IURL] are intended to be associated with each other in the obvious manner. 
Date & time(s) (IDTF] - [DTZ]) are intended to indicate the times at which the address and/or 
40 telephone and/or email will be serviced by the associated person(s) and/or organisations). 
The brand name ([BN]) attribute is intended to hold commercial brand names. Public key (PCQ 
- [K]) is intended to hold public encryption keys for secure communication with the contact 
person or organisation. 

45 The geographical location [GL] could be a latitude and longitude (eg 
E148D3ri2.5" ( S36D40' # 09.6* or E 148.5201 .S36.6693 or -148.5201,-36.6693). or a Universal 
Grid Reference (eg 55FV364402) or other global, national, regional or local location reference 
with units as specified [GLU], which is typed in or obtained by pointing to a digitally encoded 
map or other methods. In more populated regions of some countries such as the U.S., street 

50 addresses and post codes are associated with a moderately accurate geographic location and 
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can be used to interpolate geographic location data whore geographic location data is not 
explicitly stated in the CCG-data. Using a universally recognised code such as latitude and 
longitude has advantages when used with international mediums like the Internet. 
Geographical location is intended to be associated with a post, delivery address or physical 
5 address such as place of business or residence. A CCG compliant browser could use this 
reference to display a map centred on that geographic location. The purpose of the 
geographical location data is to allow browser users to specify search engine search criteria 
which will result in the search engine selecting only those Internet accessible documents which 
provide details about providers which are within a specified region. The serviced region [GLR] 
10 is intended to indicate the preferred area of operation of providers expressed in terms of 
serviced region units [GLRUJ. A radial distance (eg in kilometres) or alternate means of 
expressing an area of interest around a geographic point, such as polygons, are envisaged. 

It is envisaged that the CCG attribute_value could be composed of more than one value 
1 5 (actuatty sut>value) wherein specific characters or character strings separate individual values. 

While specific instances of element names and types have been given in this example, of 
more importance is the type of data and type controls over the display and indexing of the 
data. As an alternative to the preferred immediately following example where the CCG-data is 
20 lumped together under the HTML element named 'CCG". certain elements of the data, for 
example the classification data, could be lumped under separate HTML elements with 
distinctly different names thereby separating CCG classification data from CCG contact data. 
However, this is not preferred because the strength of association between the two types of 
data is weakened. 

25 

Example 2: Classification of Portion of a Web Page. 

Where it is desired to classify a portion of a web page, such as a paragraph about a product, 
simple CCG-data may be used in conjunction with the syntax of Examplel . For example: 
<A NAME=*Radios*>AM-FM radio receivers: </A> 
30 <CCG HREF="#Radios"> 

CN=*ANZSIC 

CC="E23.34.78:Electrfcal equipment - radio receivers AM* 
CC="E23.34.79;Electrteal equipment - radio receivers FM" 
</CCG> 

35 We won't be beaten on the price of these high quality receivers .... 

In this example, the CCG prase appears after the related anchor (<A NAME=...</A>). 
However, while such proximity visually provides an obvious association between the anchor 
and related CCG phrase, it is intended that CCG phrase containing the attribute HREF related 
to a specific anchor could appear anywhere within the body of a web page and remain related 

40 to the named anchor. The CCG phrase containing the attribute HREF could appear in a 
separate document and thereby relate the CCG-data to the entire document or to a named 
anchor although, as previously noted, coordinating separate documents can be problematic. In 
the absence of the HREF and NAME attributes, it is also intended that the CCG-data apply to 
the whole web page. 

45 

Example 3 Classification of Portion of a Web Page using XML Syntax 

Using XML syntax and similar attribute names to those of Example 2 the HTML fragment of 
Example 2 may be rewritten as: 

<A NAME="Radios">AM-FM radio receivers: </A> 
50 <XML> 
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<CCG> 

<HREF>"#Radios"</HREF> 
<CN>"ANZSIC"</CN> 

<CC>"E23.34.78;Electrical equipment - radio receivers AM"</CC> 
5 <CC>"E23.34.79;Electrical equipment - radio receivers FM"</CC> 

</CCG> 
</XML> 

We won't be beaten on the price of these high quality receivers .... 
This example demonstrates that the translation of CCG-data from HTML to XML (and the 
10 reverse) involves simple syntactical and grammatical translations. Of course, the resulting 
HTML and XML. while "well formed' might not be recognised or, if recognised, might not be 
understood by some parsers. 

Fyam ple 4: Constructing a Web Paoe Conta ining CCG-data 

15 As an example, a web page developer, AJice Jamieson. is preparing an advertisement for a 
local electrician John WilBams. trading as Kelso Electrical, who wants to advertise on the web 
for business within 30 kilometres from his office located at 18 Raglan Street, Kelso. New South 
Wales Alice uses a graphical user interface web page authoring tool capable of creating and 
modifying web pages containing HTML (and XML) CCG phrases by accepting inputs from a 

20 user The tool executes on a digital computer having input devices such as a keyboard, 
mouse, light pen and touch pad, display devices such as a CRT. LED arrays, liquid crystal 
arrays and computer-readable media such as magnetic and optical disks, memory arrays, 
magnetic tape and the like. 

25 The authoring tool also embodies knowledge of the content and structure of CCG phrases 
such as the attribute names, valid ranges and sets of associated attribute values, the normal 
order of the attributes in the CCG phrase and interdependencies between attribute values. The 
tool provides a window where web pages may be viewed in layout (browser) mode and 
another window where the HTML code may be viewed in editing mode. The tool also provides 

30 means of inserting, deleting, modifying and organising HTML elements, changing font see. 
face and colour and so forth. The tool provides means for the user to build CCG phrases by 
using input devices to select an edit control representing various types of CCG attributes from 
a list which the tool then inserts in the body of a web page together with, when not already 
present. HTML code indicative of the start and end of a CCG phrase. The user then types in 

35 the value in the attribute. Similarly, the tool provides means of converting web page text to 
CCG attributes. Using input devices, the user selects the text to be converted to a CCG 
attribute then selects an edit control from a list; the tool then inserts the HTML code necessary 
to encode the text as a CCG attribute. However, these semi-manual methods of creating and 
modifying CCG phrases are inefficient and error prone. The tool also provides a button, which 

40 can be activated by using input devices, for access to CCG phrase editing functions. The CCG 
editing functions consist of a means of extracting the CCG values from existing CCG phrases 
in the web page being edited, forms for entering and modifying the extracted CCG values, a 
layout view browser window for altering how the CCG-data displays (position, font size. face, 
colour. bokJ. normal, hiding or showing and so forth), a data view browser window to alter 

45 which CCG-data values are to be "indexed or not indexed in search engine databases, and a 
means of deleting existing CCG phrases from web pages and inserting new or changed CCG 
phrases in web pages. Editing cursors marking the current location at which text and/or data 
may be inserted, deleted or modified are provided in each window and form. 
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In the current example, the web page initially contains no CCG phrase. Clicking the CCG 
editing function button of the authoring tool causes a form to appear. The form contains 
prompts related to CCG attribute names and associated data input fields related to the CCG 
attribute values associated with the CCG attribute names, that is CCG-data. The fields are 
5 blank because, in the web page layout view, the edit cursor is not over a CCG phrase (and can 
not be since the web page initially contains no CCG phrase). The service classifications 
relevant to the web age. John Williams physical business contact address, phone and fax 
numbers, email address and geographic location and his post office business contact 
addresses are entered into the forms using a keyboard and mouse. The developer, Alice 

10 Jamieson, also includes her basic contact details where provided for on the form. The forms 
use drop down lists to select address blocks (eg physical and post office) for editing. Logic 
associated with the forms validates the CCG attribute values and interdependences. Input 
devices are then used to control the CCGndata layout view browser to modify the appearance 
of the CCG-data such as font size and colour and positioning. In the layout browser, input 

1 5 devices communicating with the edit cursor are used to highlight individual items and blocks of 
items to be changed. The post office address is highlighted as a block and moved into position 
in line with the physical address. The CCG-data view window is then used to check which data 
items are to be indexed by search engines. In this example all CCG-data (ie all CCG attribute 
values except display control values and database control values) are to be indexed. Input 

20 devices are used to control the edit cursor to highlight the entire data and a mouse is used to 
click (activate) a button to mark all the data for indexing. Then another button is clicked which 
builds an HML encoded CCG phrase of CCG attributes derived from the CCG-data values, 
display control values and database control values and inserts the CCG phrase in the web 
page at the location pointed to in the web page layout browser window. 

25 

The HTML code editing mode window was called up which revealed the following HTML 
encoded CCG phrase in the web page: 
<XML> 
<CCG> 
30 <INDEX/> 
<HIDE/> 

<CN>ANZSIC</CN> 

<CC>D36.1 1 .45;Electrical contractors - residential</CC> 

<CC>D36.1 1.46; Electrical contractors - industrial</CC> 
35 <SHOW/> 

<CONTACT/> <COPYRIGHT/> 

<BUSINESS> 

<XPOS>50</XPOS> 

<YPOS>320</YPOS> 
40 <ALIGN>centre</ALIGN> 

<SIZE>3</SIZE> 

<COLOR>black</COLOR> 

<FACE>Times New Roman</FACE> 

<BOLD/> 

45 <CLEAR>all</CLEAR> 

<TEXT>Contact :<H"EXT> 

<PNC>Mr</PNC> 

<PNG>John</PNG> 

<PNF>WiIIiams</PNF> 
50 <PQ>AIE</PQ> 
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<PA>ARUC</PA> 
<NEWLINE/> 

<PT>Managing Director</PT> 
<NEWLINE/> 

<ON>Kelso Electrical Pty. Ltd.</ON> 
<NEWLINE/> 
<NORMAL/> <ITALIC/> 
<SIZE>-2</SIZE> 

<TEXT>NSW License 45678C</TEXT> 

<NEWLINE/> 

<NORMAU> <BOLD/> 

<SIZE>*2</SIZE> 

<AT>PHYSICAL</AT> 

<AS#>18<AS#> 

<ASN>Raglan Street<ASN> 

<NEWLINE/> 

<ACN>Kelso</CAN> 

<NEWLINE/> 

<ARN>NSW<ARN> 

<NEWLINE/> 

<HIDE> 

<ANN>Australia</ANN> 

<NEWLINE/> 

<SHOW/> 

<TEXT>Phone:</TEXT> 

<TT>PREFERRED ; VOICE ; MESSAGE</TT> 

<HIDE/> 

<TC#>61</TC> 

<SHOW/> 

<TT#>0</TT#> 

<TA#>63</TA#> 

<TL#>456-7828</TL#> 

<TEXT> Fax:</TEXT> 

<TT>FACSIMILE</TT> 

<HIDE/> 

<TC#>6K/TC#> 

<SHOW> 

<TT#>0</TT#> 

<TA#>63<TA#> 

<TL#>456-7829</TL#> 

<NEWLINE/> 

<ET>INTERNET</ET> 

<EA>johnw@firefly.com.au<EA> 

<TEXT> </TEXT> 

<GLU>LatLong</GLU> 

<GL>='33.3978S:14fl.5679E</GL> 

<GLRU>Km</GLRU> 

<GLR>30 </GLR> 

<SET_SEPARATOR/> 

<XPOS>250</XPOS> 
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<YPOS>320</YPOS> 

<NEWLINE/> 

<NEWLINE/> 

<TEXT>Or write to us at :</TEXT> 
5 <NEWLINE/> 

<ON>Kelso Electrical Pty. Ltd.</ON> 

<NEWLINE/> 

<AT>POST-OFFICE</AT> 

<AP#>P.O. Box 187</AP#> 
10 <NEWLINE/> 

<APN>Sunny Comer</APN> 

<TEXT></TEXT> 

<APC>2795</APC> 

<NEWLINE/> 
15 <HIDE/> 

<ANN>Austrafia</ANN> 

<SET_SEPARATOR/> 

<HIDE/> 

<DEVELOPER/> 
20 <BUSINESS/> 

<PNG>ABce</PNG> 

<PNF>Jamieson</PNF> 

<ET>INTERNET</ET> 

<EA>all|am@frBfly.com.au</EA> 
25 <IURL>http7^vww.firefly.corn.au/~aljam/<IURL> 
</CCG> 
</XML> 

In the web page layout browser window the CCG-data displayed as follows: 
30 Contact : Or write to us at: 

Mr John Williams. AIE. ARUC. 
Managing Director 

Kelso Electrical Pty. Ltd. Kelso Electrical Pty Ltd 

NSW License 45678C P.O. Box 187 

35 18 Raglan Street Sunny Comer 2795 

Kelso 
NSW 

Phone:063-456-7828 Fax. 063-456-7829 
Email : johnw@firefty.com.au Map 

40 

Having encoded the web page in this way, Alice then posts it on the storage device of a digital 
computer connected to the Internet from where it can be retrieved through the Internet using 
the URL "http://www.firefly.com.au/-johnw/index.htmr 

45 Example 4: Constructing a Database from Web Pages Containing CCG-data 

During a routine sweep of Internet connected web page servers, a web crawler (or robot} 
operating on a server named "ccg.search.com" executing on an Internet connected digital 
computer discovers the URL 1ittpVA«ww.firefly.com.au/-johnw/index.htmr in a document it 
had previously retrieved through the Internet The web crawler decides that the URL matches 

50 it's selection criteria because the URL contains the suffix ".html". The web crawler then 
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successfully retrieves the document by extracting from the URL the address of the computer 
hosting the document addressing and sending a message (including the address of the web 
crawler) requesting the web page through the network to the web page host computer using 
TCP/IP protocol, the host computer then reads the document, addresses and sends the 
5 document to the web crawler using TCP/IP protocol, the web crawler then watting until it has 
received all parts of the web page from the host computer before proceeding. It inspects the 
contents of the document and finds that it matches the additional selection criteria that it is an 
HTML encoded document The web crawler program, depending on its state and logic, then 
parses the document, strips out and saves some or all of the URLs in the document for future 
10 examination. The web crawler program then passes the document together with the URL of 
the document through a network communications channel to an indexing program executing 
on a different computer. The indexing computer has database updating software which 
manipulates a database stored on computer-readable media. 

1 5 The indexing program parses the document from first to last character, indexing some of the 
meta data in the <head> of the document and the words in the text of the document with 
respect to the document URL. In the database of this example, unique words extracted from 
the documents already indexed are held in separate rows of a column of a database table and 
in another column of the same table on each row is an associated pointer to the first bucket or 

20 block of URLs of documents containing the word associated with the pointer. As new words 
are found, the new word is added as a new row in the word column of the table, a new bucket 
is created, the URL of the document containing the new word is inserted into the bucket and a 
pointer to the new bucket is written in the new row pointer column. When the same word is 
found in another document, the row in the table of the word is found, the pointer is retrieved 

25 from the table, the bucket pointed to by the pointer is retrieved and the URL of the other 
document is inserted in the bucket Where a bucket becomes full of URLs, a new bucket is 
created and a pointer to the new bucket for holding additional URLs is placed in the full bucket 
Deletion of words and URLs of changed or no longer existing documents is also provided for. 

30 In addition to indexing words extracted from Ihe text of the document the indexing program 
also indexes the CCG-data in the document as well as indexing words found in the CCG-data. 
When the parser finds HTML element *<XML>* in the document it switches into XML parsing 
mode and switches out of that mode when "</XML> is found. When the element - <CCG>" is 
found, the parser switches into the CCG parsing mode and switches out of that mode when 

35 *</CCG>* is found. 

The example database has a CCG-data attribute name to database property name 
correspondence table to show the relationship between the CCG-data attribute names and the 
database tables and columns (properties) where the CCG-<lata attribute values are to be 

40 stored in the database as database property values. The database property values and 
associated URLs are stored in much the same way as for words extracted from text as 
outlined above. However, CCG contact data, for example, which consists of several distinct 
CCG-data attributes which are related (eg street name. city), is stored in a database table 
having a column (property) related to each distinct CCG contact attribute name and each 

45 separate CCG contact data set (eg person's name, address, telephone number) as separated 
by lt <CCG>* ) - <SET_SEPARATOR>' and "</CCG> # is hefd in a separate row in the table. The 
values stored in each row are considered to be a set of associated property values of different 
types. 
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The indexing program, during parsing the document of Example 2 above, encounters the 
"<CCG>" element and enter* the CCG parsing mode. The parser knows to ignore display 
control attributes and to consider database control elements in the CCG phrase. The example 
indexing program opts to index all other CCG-data contained in the attribute values until 
5 explicitly instructed not to index the attribute values by encountering the "<NOlNDEX/>" 
database control element and then to recommence indexing when the m <lHDEXJ> m database 
control element is encountered. 

Taking each CCG-data attribute name and associated attribute value(s) in succession, the 

10 example indexing program uses the correspondence table to translate the CCG-data attribute 
name to the database table and column (property) names where the CCG-data attribute 
value(s) are to be stored as database property value(s). The indexing program may opt to 
translate the CCG-data attribute values to database property values by, for example, 
converting character strings of digits to binary encoded decimal representation, the string 

1 5 True" to a single bit representation and the like. The indexing program then adds or updates 
the database property value(s) l using the database table and column (property) names (or 
similar references) obtained by translation, in much the same manner as outlined above for the 
update of the database using words extracted from the document text, including associating 
the data to the document URL where desired. Where the CCG-data contains a 'HREF 

20 attribute (or similar), the URL associated with the other CCG-data is a URL taken from the 
"HREF attribute value or composed of the document URL and the "HREF attribute value if 
the attribute value is a partial or relative URL Some CCG attributes, such as *<BUSlNESS/> 
have only an implied value of true if the attribute is present and falsa if the attribute is absent 
the •<SET_SEPARATOR/>", *<CCG>" and "</CCG>° resetting such values to false. However, 

25 where attribute value(s) associated with different attribute names are still related, such as a 
person's name and a street name, the related values of different types are stored on the same 
row of the same database table but in a different column (database property) to preserve the 
relationship. "<SET_SEPARATOR/>* limits the degree of relatednese between, for example, a 
person's name occurring before the separator and a street name occurring after the separator. 

30 Using the example document and using the same database column (property) names as used 
for the CCG-data attribute names a portion of the table constructed database table would look 
like: 





PNC 


PNG 


PNF 


PQ 


PA 


PT 




URL 






















Mr 


John 


Waiiams 


AIE 


ARUC 


Managing Director 




(pointer) 





















35 Difficulties not highlighted by this example are the need to handle properties having multiple 
values of the same type, "sparse rows" where only a few values are not null (blank) and tables 
with extremely large numbers of rows. For example, the CCG-data of this example could have 
contained multiple values of personal qualifications CPQ'). To represent this type of data using 
a 2 dimensional table database system, the database would be "normalised" so that the 

40 multiple values were stored in a separate table and keys or pointers were used to relate the 
relate the items in the two tables. Numerous alternate database systems, for example those 
based on key hashing and data buckets, or tagging data values with prefixes or suffixes 
related to the type of data value may be used. Preferably, however, whatever database 
system is used, it should preserve the associations of CCG-data items present in the CCG 

45 phrases. 
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Because the geographic location data was missing from the postal address of the CCG-data in 
the example document, but a post code was present, the indexing program inferred the 
geographic location from the post code. 

5 

Exam ple 6: Finding Web Page References Usino a CCG Database 

As an example, Kevin Robson lives in Sydney but owns and has rented out a house in 
Bathurst. He wants to use the web to find some electricians based in the general Bathurst 
region (not only in Bathurst City) to contact for estimating the cost of modifying the wiring in the 
10 house. He uses his web browser to open the web page 
"httpr/Avww.ausline.com.au/web.search.htmr containing AusLine's search engine web page 
search criteria input form encoded using the HTML ^form^ element. 

The search criteria input form contains several input fields including those labelled 'Service 
15 classification", "Key words', "CftyVSuburbTTown*, "Country". *Lat/Long" and "Radius". The form 
also displays a button labelled *Map" to allow latitude and longitude to be selected by pointing 
to map images. The word "electrician" is typed into the 'Service classification- field, "house 
wiring" into the "Keywords" field, 'Bathurst* into the *City/Suburb/Town # field and "10" into the 
field "Radius". The country "Australia" was already showing in the country field because the 
20 web page server had received cookie data from the browser indicating that that was the 
country used when the browser last used the web page. The "submit search" button on the 
web page was dieted. The browser transmitted a message using TCP/IP protocol to the 
AusLine server containing the input field values encoded in the header of the message. 

25 After a short delay, the search result HTML encoded web page was returned. Clicking on the 
"Service classification" input field drop down list box to check the classifications used in the 
search revealed three items: 

• Electrical contractors - residential 

• Electrical contractors - industrial 
30 • Electrical engineers 

The search engine attached to the server obtained those classifications by using word 
stemming and searching the text of the service classifications held in its database. The 
Lat/Long field contained the value -33.3856S;148.5743E* which the search engine obtained 
by looking up the latitude and longitude of the town "Bathurst" in the country "Australia" in it's 
35 database. Clicking on the "'Map* button retrieved a web page having the image of a map 
centred on the town of Bathurst and showing the area 20 Km around it The search engine 
obtained the map by making a request to another Internet connected server and supplying the 
latitude, longitude and radius. Clicking on the browser "Back" button returned to the search 
results page. 

40 

The search results contained 8 titles, brief descriptions and URLs including a reference 
containing the URL "http://www.firefly.com.au/-johnw/index.htmr. Retrieving each in turn 
revealed that all were well focused according to the search criteria being related to electricians, 
electrical contractors and engineers in the Bathurst area. The search engine obtained these 
45 references to web pages by: 

• searching its database of service classification titles with words stemming from 
-electrician" which resulted in three service classification codes, 

• searching it's database using the three service classification codes to obtain an 
intermediate list of URLs of web pages containing those CCG codes 
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• searching it's database for the two keywords to obtain an intermediate list of URLs of 
web pages containing those words in the web page text, 

• Searching if s database to find the latitude and longitude of Bathurst, Australia, 

• searching rfs database to obtain an intermediate list of web pages which contain 
5 latitude and longitude data lying within 10 Km of the latitude and longitude of 

Bathurst, Australia, 

• producing as a result list, a list of URLs which are common to all the intermediate lists, 

• obtaining from if s database the title and brief description of the web pages, 

• formatting the titles, descriptions and URLs into an HTML encoded report. 
10 • transmitting the report to the enquiring web browser. 

Example 7: Finding Contact Details Using a CCG Database 

As an example. Jim Jones of Jones and Sons wants to send a recall notice about a faulty 
batch of UV stabilised electrical power cable to all Electrical contractors and Electrical 
1 5 wholesalers in Australia who have email addresses. He uses his web browser to open the web 
page "httpJ/www.ausline.com.au/contact_search.htmr containing AusLine's search engine 
contact search criteria input form encoded using the HTML '<form>' element. 

The search criteria input form contains several input fields including those labelled "Service 
20 classification', •Country" and 'Output format*. The word 'electric** is typed into the 'Service 
classification" field, the word 'Australia" is typed into the 'Country' field and the Tabular - 
Name & Emair option in the "Output format" drop down list box is selected. The "Submit 
search' button on the web page is clicked. The browser transmits a message using TCP/IP 
protocol to the AusLine server containing the input field values encoded in the header of the 
25 message. 

After a short delay, the search result HTML encoded web page is returned. Clicking on the 
"Service classification" input field drop down list box to check the classifications used in the 
search revealed too many classifications for the result to be sufficiently focused. The following 
30 four classifications were selected from the Est: 

• Electric cable - ducting systems 

• Electrical contractors - residential 

• Electrical contractors - industrial 

• Electrical wholesalers 

35 and the "Submit search" button is pressed again to refine the search. 

The search results contained 3,473 names and associated email addresses and URLs to full 
contact details- Jim saved the search result page on his computer so that he coukJ use his 
email program to send the recall notice to each email address in the list. The email address 
40 'johnw@firefiy.com.au' was included in the list. 

The search engine obtained these references to web pages by: 

• searching ifs database using the four service classification titles which resulted in four 
service classification codes, 

• searching ifs database using the four service classification codes to obtain an 
intermediate list of database primary keys of database table rows containing those 
service classification codes in the database Service classification attribute, 

• searching it's database using the country name "Australia" to obtain an intermediate 
list of database primary keys of database table rows containing that word in the 
database Country attribute, 
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• producing as a result list, a list of database primary keys which are common to both 
the intermediate lists, 

• obtaining from ifs database using the result Ost the values of Ihe name and email 
attributes, 

5 • using the HTML <table> element to format the name values, email values and full 
detail URLs into an HTML encoded report. 

• transmitting the report to the enquiring web browser. 

This example relates to finding sets of associated database contact values without requiring 
10 references to web pages. However, finding other sets of associated database values such as 
sets of associated industry classification values and geographic location values might also be 
useful for some purposes. 

Thus it is appreciated that the afore stated goals, advantages and objectives are achieved by 
15 the teachings herein. In particular it '« seen that, unlike the prior art. efficiently searchable 
Yellow pages and White pages databases and the like may be automatically constructed from 
HTML encoded web pages. Additionally the database entries may be automatically linked to 
specific web pages and portions of web pages allowing convenient methods of indexing of 
product and service catalogues and the like. It is also appreciated that simpler methods of 
20 constructing databases suited to a variety of other uses such as industry and subject 
directories are also provided. 

From the foregoing teachings and with the knowledge of those skilled in the art. it is apparent 
that other modifications and adaptations of the invention will become apparent For example. 
25 the method steps disclosed and claimed herein may be practiced in a variety of different 
orders. CCG-data may take on a variety of different forms within the meaning of the daims. 
Thus, it is our intention to include within the scope of the daims not only the invention literally 
embraced by the language of the claims but to indude all such modifications and adaptations 
which may come to those skilled in the art 
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What I claim is: 

1. An HTML encoded web page embodied on a computer-readable medium, said web 
page comprising at least one HTML encoded CCG phrase, each CCG phrase 
5 comprising: 

a) HTML code indicative of the start of a CCG phrase, 

b) at least one CCG-data attribute, and 

c) HTML code indicative of the end of a CCG phrase. 

10 2. An HTML encoded web page embodied on a computer-readable medium, said web 
page comprising at least one HTML encoded CCG phrase, each CCG phrase 
comprising: 

a) HTML code indicative of the start of a CCG phrase, 

b) at least two CCG-data attributes, 

15 c) at least one database control attribute separating said CCG-data attributes into at 
least two sets of CCG attributes, and 

d) HTML code indicative of the end of a CCG phrase. 

3. An HTML encoded web page embodied on a computer-readable medium, said web 
20 page comprising at least one HTML encoded CCG phrase, each CCG phrase 

comprising: 

a) HTML code indicative of the start of a CCG phrase, 

b) at least one CCG-data attributes, 

c) at least one attribute oft database control attributes, display control attributes; and 
25 d) HTML code indicative of the end of a CCG phrase. 

4. A computer implemented method of building a web page comprising at least one HTML 
encoded CCG phrase, the method comprising the steps of: 

a) displaying a web page on a computer display device, 
30 b) displaying an edit cursor indicating a character position on said display device and 
a corresponding character position in said web page, said edit cureor being 
positionable within the display of said web page by use of computer input devices, 
c) separately displaying on said computer display device a set of edit controls 
representing CCG-data attribute types, 
35 d) positioning said edit cursor within said display of said web page using said input 
devices, 

e) selecting an edit control from said set of edit controls using said input devices, 

f) relating said selected edit control to a corresponding CCG-data attribute name, 

g) constructing a CCG-data attribute character string comprising a character string 
40 representing said attribute name and another character string representing an 

empty CCG-data value, 

h) if the said edit cursor is positioned outside a CCG phrase. 

i) inserting into said web page, at the character position indicated by said edit 
cursor, a start character string comprising HTML code indicative of the start 

45 of a CCG phrase. 

ii) inserting into said web page, immediately after the end of said start 
character string, an end character string comprising HTML code indicative of 
the end of a CCG phrase, and 

iii) positioning said edit cursor between said start and end character strings, 
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i) inserting said CCG-data attribute character string into said web page at the 

character position indicated by said edit cursor, 
j) positioning said edit cursor at the character position in said web page of the CCG- 

data value of said inserted CCG-data attribute character string, 
k) inputting characters using a keyboard, 

[) inserting said input characters into said web page at the character position 
indicated by said edit cursor, thereby converting said empty CCG-data value to a 
non-empty CCG-data value, and 

m) writing said web page on computer-readable media. 

A computer implemented method of building a web page comprising at least one HTML 
encoded CCG phrase, the method comprising the steps of: 

a) displaying a web page on a computer display device, 

b) displaying a start edit cursor and an end edit cursor on said display device, each 
said edit cursors indicating a character position on said display device and a 
corresponding character position in said web page, said edit cursors being 
positionable within the display of said web page by use of computer input devices, 

c) separately displaying on said computer display device a set of edit controls 
representing CCG-data attribute types, 

d) selecting a string of web page characters on said display device using said input 
devices to position said start edit cursor to indicate the start said string of web 
page characters and said end edit cursor to indicate the end of said string of web 
page characters, 

e) selecting an edit control from said set of edit controls using said input devices, 

f) relating said selected CCG-data control to a corresponding CCG-data attribute 
name, 

g) constructing a CCG-data attrfcute character string comprising a character string 
representing said attribute name and another character string representing a CCG- 
data value containing said string of web page characters, 

h) deleting said string of web page characters from said wen page, 

i) if the said start edit cursor b positioned outside a CCG phrase, 

i) inserting into said web page, at the character position indicated by said start 
edit cursor, a start character string comprising HTML code indicative of the 
start of a CCG phrase, 

ii) inserting into said web page, immediately after the end of said start 
character string, an end character string comprising HTML code indicative of 
the end of a CCG phrase, and 

iii) positioning said start edit cursor between said start and end character 
strings, 

j) inserting said CCG-data attribute character string into said web page at the 
character position indicated by said start edit cursor, thereby converting said string 
of web page characters to a CCG-data attribute value contained within a CCG* 
data attribute contained within CCG-phrase, and 

k) writing said web page on computer-readable media. 

A computer implemented method of building a web page comprising at least one HTML 
encoded CCG phrase, the method comprising the steps of: 

a) displaying a CCG-data input form on a computer display device, 

b) * inputting CCG-data values into fields of said data input form using computer input 

devices, 
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e) 
f) 
9) 

h) 



c) inserting into the body of a web page a start character string comprising HTML 

KL*. •*• «— *«" — daU «"» fo,ro t09emer retoted 

SC^W'SS exceed «. value ,o a correspo^ COM- 

SSSSTTcWM- aBribute character string compos a dmhrMl 
.ttnbute nam. and another character string represent 

CC^ata MM character string into said web page between said 

start and end character strings, 
i) writing said web page on computer-readable media. 

A computer imp,emented method of 
associated property values where* each set «*«***^ ^ £^ act va , ues , 
different types, the property values being i any of ctesaficabon ^ 
geographic location values, hereinafter collectively referred to as CCG-data. tne 

7 P ^tS— • web pages from a computer network, each web page being 

b) Sgt^ web page for a CCG phrase that includes a p.uratty of different 
types of CCG-data attributes, 

c) extracting a plurality of said attributes from said phrase, attribute 
% fnvrTeach extracted attribute, deriving an attribute name and a related attribute 

e, defining the type of said extract attribute and said e«ribu.e vaiue by 

in said set of associated property values. 

A computer Implemented mettod of buitfnc ,* « -jj** * 
associated property values wherein each set includes at leasi wo p ,u ^= ' „ a i.. es 
X^wpes'tne' property values being any of 

geographic location values, hereinafter collectively referred to as CCG-data. me metnoa 

rcsstsi- — «- ,rem ■ c ° mpu,8r ^ *** web pafle b * 9 

b, '£5£fl£!L page for a CCG phn.se that deludes a. ,eas, one type o, 
CCG-data attribute. 

c) extracting at least one said attribute from said phrase. att rihijte 
% torn ealh extracted attribute, deriving an attribute name and a related attnbute 

value, 



28 

e) determining the type of said extracted attribute and said attribute value by 
reference to said attribute name, 

f) relating said type of attribute value so determined to a corresponding type or 
database property value, 

q) relating the URL of said web page to an other type of database property value, 

h) writing said derived attribute value to the database property value of said 
determined corresponding type in a set of associated property values, and 

i) writing the URL of said web page to a database property value of said other type 
in said set of associated property values. 

A computer implemented method of building a database which comprises sets of 
associated property values wherein each set includes at least two property values of 
different types, the property values being any of classification values, contact values 
geographic location values, hereinafter collectively referred to as CCG^ata, the method 
comprising the steps of: 

a) retrieving successive web pages from a computer network, _ 

b) searching each web page for a CCG phrase that includes a plurahty of different 
types of CCG-data attributes. 

c) extracting a plurality of said attributes from said phrase, 

d) from each extracted attribute, deriving an attribute name and a related attribute 

VdllJG 

e) determining the type of said extracted attribute and said attribute value by 
reference to said attribute name, 

f) relating said type of attribute value so determined to a corresponding type ot 
database property value, and 

g) writing said derived attribute value to the database property value of said 
determined corresponding type in a set of associated property values. 

A computer implemented method of finding references to web pages posted on 
computer network the method using a database comprising sets of associated property 
values, the property values being any of classification values, contact values, geographic 
location values, hereinafter cofcdively referred to as CCG-data. and URL references, 
the method comprising the steps of: 

a) receiving a query phrase including query relational expressions from a computer 

HQtworK 

b) parsing said query phrase and extracting each of said query relational expressions 
included therein, . 

c) from each extracted query relational expression, deriving a query field name 

d) determining the type of said query relational expression by reference to its denved 

query field name, . „ 

e) relating said type of query relational expression so determined to one of the 
following query relational expression types: CCG-data type, other type, 

0 provided said query relational expression is a CCG-data type, denvmg a query 
relational operator and query value related to its query field name from said query 
relational expression, 

g) determining the type of said query value by reference to said query field name. 

h) relating said type of query value so determined to a corresponding type of 
database property value, 
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i) locating database property values of said determined corresponding type which 
return a true value when tested against said query value using said query 
relational operator, 

j) extracting from said database a list of the URL references associated with the so 
located database property values, 

A computer implemented method of finding sets of associated database property values 
the method using a database comprising sets of associated property values wherein 
each set includes at least two property values of different types, the property values 
being any of classification values, contact values, geographic values, hereinafter 
collectively referred to as CCG-data, the method comprising the steps of: 

a) receiving a query phrase including query relational expressions from a computer 
network, 

b) parsing said query phrase and extracting each of said query relational expressions 
included therein, 

c) from each extracted query relational expression, deriving a query field name, 

d) determining the type of said query relational expression by reference to its derived 
query field name, 

e) relating said type of query relational expression so determined to one of the 
following query relational expression types: CCG-data type, other type, 

0 provided said query relational expression is a CCG-data type, deriving a query 
relational operator and query value related to its query field name from said query 
relational expression, 

g) determining the type of said query value by reference to said query field name, 

h) relating said type of query value so determined to a corresponding type of 
database property value, 

i) locating database property values of said determined corresponding type which 
return a true value when tested against said query value using said query 
relational operator, 

j) extracting from said database sets of associated database property values 
associated with the so located database property values. 

A method of displaying a web page comprising at least one HTML encoded CCG 
phrase, the method comprising the steps of: 

a) retrieving a web page from a computer network, 

b) parsing said retrieved web page to locate an HTML code indicative of the start of a 
CCG phrase, 

c) parsing said located CCG phrase and extracting successive CCG attributes 
contained therein until an HTML code indicative of the end of said CCG phrase is 
found, 

d) from each extracted attribute, deriving an attribute name, 

e) determining the type of said extracted attribute by reference to its derived attribute 
name, 

f) relating said type of attribute so determined to one of the following attribute types: 
database control, display control, CCG-data, 

g) provided said extracted attribute is not a database control type, deriving an 
attribute value related to its attribute name from said extracted attribute, 

h) determining the type of said attribute value by reference to said attribute name. 

}) relating saW type of attribute value so determined to a corresponding type of 
parameter of a display-de vice-control-program, 
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j) writing said attribute value to said parameter, and 

k) where said type of attribute is a CCG-data type, causing said display-device- 
control-program to effect display of said .attribute value on a display device, 
formatted and positioned according said display-devtce-control-program 
5 parameters whereby successive values of CCG-data of the CCG phrase are 

displayed. 
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ABSTRACT , . , 

A system for automatically creating databases containing industry, service, product ana 
subject classification data, contact data, geographic location data (CCG-data) and links to web 
pages from HTML. XML or SGML encoded web pages posted on computer networks such as 
5 the Internet or Intranets. The web pages containing HTML. XML or SGML encoded CCG-data, 
database update controls and web browser display controls are created and modified by using 
simple text editors, HTML. XML or SGML editors or purpose built editors. The CCG databases 
may be searched for references (URLs) to web pages by use of enquiries which reference one 
or more of the items of the CCG-data. Alternatively, enquiries referencing the CCG-data in the 
10 databases may supply contact data without web page references. Data duplication and 
coordination is reduced by including in the web page CCG-data display controls which are 
used by web browsers to format for display the same data that is used to automatically update 
the databases. 



BNSDOCID:<AU 5303 196A I > 



FILED BY IDS 



WORLD INTELLECTUAL PROPERTY ORGANIZATION 
International Bureau 




PCT 

INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) International Patent Classification 6 
G06F 17/40 



Al 



(11) International Publication Number: WO 98/03928 

(43) International Publication Date: 29 January 1998 (29.01.98) 



(21) International Application Number: PCT/US97/ 1 2628 

(22) International Filing Date: 18 July 1 997 (18.07.97) 



(30) Priority Data: 

08/685,025 



23 July 1996 (23.07.96) 



US 



(71) Applicant: LEXTRON SYSTEMS, INC. [US/US]; 20264 

Ljepava Drive, Saratoga, CA 94401 (US). 

(72) Inventor: KIKINIS, Dan; 20264 Ljepava Drive, Saratoga. CA 

94401 (US). 

(74) Agent: BOYS, Donald. R.; P.O. Box 187, Aromas, CA 95004 
(US). 



(81) Designated States: CN, JP, European patent (AT, BE, CH, DE, 
DK, ES, FI, FR, GB. GR, IE, IT, LU, MC, NL, PT, SE). 



Published 

With international search report. 



(54) Title: INTEGRATED SERVICES ON INTRANET AND INTERNET 
(57) Abstract 

A web server system for delivering e-mail 
messages and other forms of digital documents con- 
verts incoming documents into Hypertext Markup 
Language (204) and stores them in an indexed data- 
base comprising directories and subdirectories. As 
requests are received (223) from users, HTML doc- 
uments are retrieved from the directories and trans- 
mitted directly to the users with no need for conver- 
sion to another format. In a preferred embodiment 
directories are assigned to users and a user accesses 
a WEB page on the server to access digital docu- 
ments. Attachments in this embodiment are related 
by hyperlinks. 




BNSDOCID: <WO 980392SA1 I > 



FOR THE PURPOSES OF INFORMATION ONLY 



Codes used to identify States party to the PCT on the front pages of pamphlets publishing international applications under the PCT. 



AL 


Albania 


ES 


Spain 


LS 


Lesotho 


SI 


Slovenia 


AM 


Armenia 


Fl 


Finland 


LT 


Lithuania 


SK 


Slovakia 


AT 


Ausiria 


FR 


France 


LU 


Luxembourg 


SN 


Senegal 


AU 


Australia 


GA 


Gabon 


LV 


1 -at via 


SZ 


Swaziland 


AZ 


Azerbaijan 


GB 


United Kingdom 


MC 


Monaco 


TD 


Chad 


BA 


Bosnia and Herzegovina 


GE 


Georgia 


MD 


Republic of Moldova 


TG 


Togo 


BB 


Barbados 


Gil 


Ghana 


MG 


Madagascar 


TJ 


Tajikistan 


BE 


Belgium 


GN 


Guinea 


MK 


The former Yugoslav 


TM 


Turkmenistan 


BF 


Burkina Faso 


GR 


Greece 




Republic of Macedonia 


TR 


Turkey 


BG 


Bulgaria 


HU 


Hungary 


ML 


Mali 


TT 


Trinidad and Tobago 


BJ 


Benin 


IE 


Ireland 


MN 


Mongolia 


UA 


Ukraine 


BR 


Brazil 


IL 


Israel 


MR 


Mauritania 


vc 


Uganda 


BY 


Belarus 


IS 


Iceland 


MW 


Malawi 


US 


United States of America 


CA 


Canada 


IT 


Italy 


MX 


Mexico 


uz 


Uzbekistan 


CF 


Central African Republic 


JP 


Japan 


NE 


Niger 


VN 


Viet Nam 


CG 


Congo 


KE 


Kenya 


NL 


Netherlands 


YU 


Yugoslavia 


CH 


Switzerland 


KG 


Kyrgyzsten 


NO 


Norway 


zw 


Zimbabwe 


CI 


Cote d'T voire 


KP 


Democratic People's 


NZ 


New Zealand 






CM 


Cameroon 




Republic of Korea 


PL 


Poland 






CN 


China 


KR 


Republic of Korea 


PT 


Portugal 






cu 


Cuba 


KZ 


Kazakstan 


RO 


Romania 






cz 


Czech Republic 


LC 


Saint Lucia 


RU 


Russian Federation 






DE 


Germany 


LI 


Liechtenstein 


SD 


Sudan 






DK 


Denmark 


LK 


Sri Lanka 


SE 


Sweden 






EE 


Estonia 


LR 


Liberie 


SG 


Singapore 







WO 98/03928 PCTYUS97/12628 

Integrated Services on IntraNet and Internet 



Field of the Invention 

The present invention is in the area of multimedia document handling and 
cross-media access of such documents based both on Internet, Intranet and Telephony 
networks. 
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Cross Reference to related Ap plications 

This disclosure is related to patent application 08/629,475 by the same inventor. 



15 Background of the Invention 

Today many different electronic services are available for communication. 
Such services include, but are not necessarily limited to voice-mail, e-mail, paging, 
alpha paging, cellular phones, paging phones, fax machines and so forth. There are 
20 also cross-linked services available, such as paging on digital cell phones, and the 

like. In general, however, each type of media is limited to one access, usually in very 

primitive manner. 

Recently Motorola announced e-mail on cellular telephones: To use this 
service, a user calls a special number, and the saved e-mail is read over the phone to 

25 the user. Such a service may be helpful in some cases, while not be very helpful in 

others. If, for example, a spreadsheet is attached, the spreadsheet cannot be read over 
the phone. Even if a spreadsheet could be converted, reading potentially hundreds of 
numbers over the phone will most likely result in several transcription errors, 
rendering the result basically useless. 

30 What is needed arc better devices and better methods, crossing traditional media 

boundaries. 
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2 

One simple way to offer integrated services is to use a database on a server 
that is connected to the World Wide Web (WWW) Then, when data is requested, that 
data is called by invoking a so-called CGI-application (these are applications that are 
launched by a web page). The CGI application then sorts out data, and presents the 
5 result as a dynamically-built web page. During a Comdex show on about November 
14, 1995 Lotus, Inc. announced such a program for their Notes product. This addition 
allows users to read Notes. Others have followed since. 

The problem remaining with such solutions is that most of them are 
proprietary and also slow, meaning that only a very limited number of users can be 
10 serviced concurrently by one server. This is partly because a CGI application has to be 
launched for every request, invoking a database inquiry, which in all cases consumes 
substantial computer power and time. 

15 Summary of the Invention 

In a preferred embodiment of the present invention a web-server system for 
processing and providing digital documents, comprising: a receiver-converter for 
receiving digital documents and converting the digital documents to Hyper Text 

20 Markup Language (HTML) format; a directory structure providing a database; and an 
index listing the contents of the database by directory structure. Upon receipt of a 
digital document, the receiver-converter converts the digital document into HTML 
format and stores it in the directory structure, and updates the index. In some 
embodiments the system further comprises an access program wherein database 

25 directories are assigned to individual users and displayed as web pages. In this 

embodiment attachments to incoming e-mail are related to stored mail as hyperlinks 
to the web page. 

A method is provided comprising steps of (a) making a database on a server 
composed of directories assigned to users; (b) receiving digital documents at the 
30 server; (c) converting the digital documents to Hypertext Markup Language (HTML) 
format; and (d) storing the HTML digital documents in the directories. In some 
embodiments the method further comprises steps for: (e) receiving a request from a 
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user; 



r; (0 retrieving a document from the database in HTML format; and (g) 
transmitting the document to the user over the Internet. 

In various embodiments, assuming servers of relatively equal computing 
power, by using a directory structure instead of an integrated database, and pre- 
converting documents to HTML format prior to storing for later retrieval by a user, 
more users can be served at a faster pace than can be served in conventional systems. 
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Brief Description of the Drawing Figures 

Fig. 1 is a generalized topology example showing arrangement and 
connectivity of equipment in an embodiment of the present invention. 

Fig. 2 is a flowchart illustrating processes and operations in practicing 
embodiments of the present invention. 



Description of the Preferred Embodiments 

The present invention in various embodiments differs from the prior art in that 
20 a database is not used, as described above in the Background section. When a 
database is used with the internet (WWW), once data is extracted, it must be 
converted to HTML (Hyper Text Markup Language) before the data can be 
transmitted on the Internet. This is typically done as a function of the CGI application 
called. Instead, in embodiments of the present invention, the digital documents 
25 (mails) are as HTML files in a directory structure representing the database. In 

addition, in some embodiments, even the index is kept in a HTML file, and the index 
is continuously updated as messages come in. 

In an alternative embodiment small downloadable modules, in technologies 
such as JAVA or similar, are provided on a server connected to the WWW. A user 
30 first downloads the HTML index and a small application to handle it, then executes 
actual index searches on the user machine. Once a file or files are located in the index 
a request is set over the Internet to access the file or files from the server 
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In one embodiment the existing "Send Mail of the UNIX operating system on 
a server is modified in a way that incoming mail is converted into HTML files, and 
then stored in appropriate pre-arranged directories. An index file is then updated. If 
the incoming mail has attachments, they are stored in the same directory and can have 
5 a hyperlink from the mail page. A user may then either view or download the 
attachment(s). 

Additional functions, such as address book, sending mail etc. are provided in 
embodiments of the invention by using a Java applet having a relatively small user 
interface. The applet can directly access files containing addresses and insert them 

10 into messages and so forth. 

In some embodiments of the invention, to facilitate adding of addresses, the 
addresses can be marked as well on the message, and by clicking on the addresses a 
user can cause the addresses to be transferred into the list. The address list also 
contains, in some embodiments, phone numbers and addresses (snail mail), that can 

15 be launched into other applications such as a phone dialer. This feature is very 

attractive in conjunction with such things as voice-mail, video-mail etc. Players and 
auxiliary tools may be launched to connect a user with a calling party, or to allow a 
user to leave voice-mail and or video-mail messages. 

On the receiving send, where a user is using a system according to an 

20 embodiment of the present invention, voice-mail and video-mail are converted when 
received into one or several 'standard' formats, so that when the user wants to view it, 
no long delays arc incurred. Without this feature a user may have to launch a CGI- 
controlled search through a database, followed by on-the-fly conversion, which can 
consume a substantial amount of CPU power. 

25 In embodiments of the present invention all files are prepared when arriving, 

such that the user when checking, can just browse. By using an H I I PS server, 
security is provided by standards already established on the Internet. This feature 
allows more users on a single server, which ultimately reduces costs dramatically. 

In an ideal setup, the user can go to a web-page, and open his own account all 

30 by himself, since only name, password and credit card (or some other form of 

payment) are needed. There are no IP addresses etc. to worry about. Additionally, The 
user may also open up his own web page much like the same web-page referred to 
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above, and then upload through a secure HTTPS transaction new pages that he created 
on his own system. 

It will be apparent to those with skill in the art that there are many alterations 
that might be made in the embodiments described without departing from the spirit 
5 and scope of the invention. For example, there are many ways directory structures 
may be provided and many ways individual programmers might furnish code to 
accomplish the modules of the invention. There are similarly many sorts of platforms 
and data links that may be used in practicing embodiments of the invention. The 
invention is limited in scope only by the claims which follow. 

10 
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What is claimed is: 

I . A web-server system for processing and providing digital documents, comprising: 
a receiver-converter for receiving digital documents and converting the digital 
5 documents to Hyper Text Markup Language (HTML) format; 
a directory structure providing a database; and 
an index listing the contents of the database by directory structure; 
wherein, upon receipt of a digital document, the receiver-converter converts 
the digital document into HTML format and stores it in the directory structure, and 
1 0 updates the index. 

2. A web-server system as in claim 1 further comprising an access program wherein 
database directories are assigned to individual users and displayed as web pages. 

15 3. A web-server system as in claim 1 wherein attachments in incoming e-mail are 
related to stored mail as hyperlinks to the web page. 

4. A method for providing integrated digital document services to users, comprising 
steps of: 

20 (a) making a database on a server composed of directories assigned to users; 

(b) receiving digital documents at the server; 

(c) converting the digital documents to Hypertext Markup Language (HTML) 
format; and 

(d) storing the HTML digital documents in the directories. 

25 

5. The method of claim 4 further comprising steps for: 

(e) receiving a request from a user; 

(0 retrieving a document from the database in HTML format; ad 
(g) transmitting the document to the user over the Internet. 



WO 98/03928 



PCT/US97/12628 



1/2 




nun 


m 




EM 


i — io 



c 
o 

Gu 
<L> 

0J 
<L» 

'E 

00 <L> 

2 1/3 
oo 

<u 



SUBSTITUTE SHEET (RULE 26) 



QNQrwm- ,\wn qoaqoooa i i ^ 



WO 98/03928 



2/2 



PCT/US97/12628 




INTERNATIONAL SEARCH REPORT 



International application No. 
PCT7US97/12628 



A. CLASSIFICATION OF SUBJECT MATTER 
IPC(6) :G06F 17/40 

US CL :395/774 

According to International Patent Classifi cation (IPC) or to both national classification and IPC 

B. FIELDS SEARCHED 

Minimum documentation searched (classification system followed by classification symbols) 

U.S. : 395/774.762, 610; 358/402, 403 



Documentation searched other than minimum documentation to the extent that such documents are included in the fields searched 



Electronic data base consulted during the international search (name of data base and, where practicable, search terms used) 
IEEE CD-ROM, Computer Select 1995-1996 CD-ROM, APS 



C. DOCUMENTS CONSIDERED TO BE RELEVANT 



Category* 


Citation of document, with indication, where appropriate, of the relevant passages 


Relevant to claim No. 


X.P 


US 5,649,186 A (FERGUSON) 15 JULY 1997 (15.7.97) SEE 
ABSTRACT 


1-5 


Y,E 


US 5,654,886 A (ZERESKI ET AL.) 05 AUGUST 1997 
(5.8.97) SEE ABSTRACT 


1-5 


Y.P 


US 5.627,997 A (PEARSON ET AL.) 06 MAY 1997 (6.5.97) 
SEE ABSTRACT 


1-5 


Y,P 


US 5,623.589 A (NEEDHAM ET AL.) 22 APRIL 1997 
(22.4.97) SEE ABSTRACT 


1-5 


Y.P 


US 5,608,874 A (OGAWA ET AL.) 04 MARCH 1997 
(4.3.97) SEE ABSTRACT 


1-5 


Y,P 


US 5,608,446 A (CARR ET AL.) 04 MARCH 1997 (4.3.97) 
SEE ABSTRACT 


1-5 



| x| Further documents arc listed in the continuation of Box C. | | See patent family annex. 



* .. Special categoric* of cited document*: ™ 

A* document defining the general stale of the art which i» not considered 

to be pan of particular relevance 

*X* 

E' earlier documenl published on or after the tntemntionai filing dale 

X" document which may throw doubta on priority claim(a) or which is 

cited to establish the publication date of another citation or other 
special reason (as specified) ' 

'0* document referring to an oral disclosure, use. exhibition or other 

means 

T" document published prior to the international filing date but later than *&- 



later document published after (he internal ional filing dale or priority 
date and not in conflict with the application but cited to understand the 
principle or theory underlying the invention 



document of particular relevance; the claimed invention cannot be 
conatfered novel or cannot be considered to invotvc an inventive step 
when the document ia taken a kmc 

document of particular relevance; the claimed invention cannot be 
considered to involve an inventive step when the document is 
combined with one or more other such documents, such combination 
being obvious to a person skilled in the art 

document member of the same patent family 



Dale of the actual completion of the international search 
26 AUGUST 1997 


Dale of mailing of the international search report 

0 9 OCT 1997 


Name and mailing address of ihc ISA/US 
Commissioner of Patents and Trademarks 
Box PCT 

Washington, D.C. 20231 
Facsimile No. f703) 305-3230 


Authorized offcer * 
Telephone No. (703) 305-8449 



Form PCT/ISA/210 (second sheet)(July 1992)* 



INTERNATIONAL SEARCH REPORT 



International application No. 
PCTAJS97/12628 



C (Continuation). DOCUMENTS CONSIDERED TO BE RELEVANT 


Category* 


Citation of document, with indication, where appropriate, of the relevant passages 


Relevant to claim No. 


Y,P 


US 5,577,108 A (MANKOVJTZ) 19 NOVEMBER 1996 
(19 1 1 96) SEE ABSTRACT 


1-5 


Y,P 


US 5,553,281 A (BROWN ET AL.) 03 SEPTEMBER 1996 
(3.9.96) SEE ABSTRACT 


1-5 


Y 


US 5,530,852 A (MESKE JR. ET AL.) 25 JUNE 1996 (25.6.96) 
SEE ABSTRACT 


1-5 


Y 


US 5,406,557 A (BAUDOIN) 11 APRIL 1995 (11.4.95) SEE 
ABSTRACT 


1-5 


Y 


KIKUCHI ET AL., USER INTERFACE FOR A DIGITAL 
LIBRARY TO SUPPORT CONSTRUCTION OF A VIRTUAL 
PERSONAL LIBRARY, PROCEEDINGS OF THE 
INTERNATIONAL CONFERENCE ON MULTIMEDIA 
COMPUTING AND SYSTEMS, 17 JUNE 1996, P. 429-432 , 
SEE P.429 


1-5 


Y 


WILKINSON, HARMONIC CONVERGENCE, PC WEEK, 11 
MARCH 1996, V. 13.N. 10,P. 15-16, SEEP. 15 


1-5 


Y 


CROTTY, NETSCAPE NAVIGATOR SHATTERS STATIC 
WEB PAGES, MACWORLD, DECEMBER 1995, V.12, N.12, 
P.34-35, SEE P. 34 


1-5 


Y 


SEMILOF, PROTOTYPE E-MAIL WARE DRAWS MIXED 
REVIEWS, COMMUNICATIONSWEEK, 17 APRIL 1995, 
N.553, P. 15, SEE P. 15 


1-5 



Form PCT/ISA/210 (continuation of second sheelXJuly 1992)* 



XP-0021 93377 



FILED BY IDS 



P.D.J2.-J49...7 

p..32 : „3.52 = V / 




Extracting Entity Profiles from Semistructured Information Spaces 



Robert A. Nado Scott B. Huffman 

Price Waterhouse Technology Centre 
68 Willow Road 
Menlo Park, CA 94025-3669 
{nado, huffman} @tc.pw.com 



Abstract 

A semistructured information space consists of 
multiple collections of textual documents containing 
fielded or tagged sections. The space can be highly 
heterogeneous, because each collection has its own 
schema, and there are no enforced keys or formats for 
data items across collections. Thus, structured 
methods like SQL cannot be easily employed, and 
users often must make do with only full-text search. In 
this paper, we describe an approach that provides 
structured querying for particular types of entities, 
such as companies and people. Entity-based retrieval 
is enabled by normalizing entity references in a 
heuristic, type-dependent manner. The approach can 
be used to retrieve documents and can also be used to 
construct entity profiles - summaries of commonly 
sought information about an entity based on the 
documents* content. Xhe approach requires only a 
modest amount of meta-information about the source 
collections, much of which is derived automatically. 

1 Introduction 

Decentralized information sharing architectures like 
the World Wide Web and Lotus Notes make it easy for 
individuals to add information, but as the space grows, 
retrieval becomes more and more difficult. 
Semistructured information sharing systems, 
including Lotus Notes™ and a variety of meta-tagging 
schemes being developed for the World Wide Web 
(e.g. Apple's Meta Content Framework [Guh97]) ( 
address part of this problem by providing the ability to 
structure local parts of the information space. In a 
semistructured information space, documents are 
sectioned into weakly-typed fields according to user 
specifications, and documents with the same field 



structure can be grouped into collections. Within a 
collection, field values can be used as indexes for 
easier retrieval. 

Unfortunately, semistructuring document collections 
does not solve the problem of retrieving information 
across a large information space. Even if individual 
collections are well designed for retrieval, users can be 
overloaded with the sheer number of collections. 
Retrieval across the entire space is difficult because it 
is highly heterogeneous. Each collection has its own 
local schema, and there are no enforced keys or 
formats for data items within or across collections. 

Our work addresses the problem of finding and 
integrating useful information across collections in 
large semistructured information spaces. Our goal is 
to provide querying that is more powerful and precise 
than full-text search, but without, requiring the 
collections to be strongly typed, data normalized, and 
fully mapped to a global schema, as methods like 
multidatabase SQL require. In this paper, we focus on 
the retrieval of integrated summaries of useful 
information (entity profiles), drawn from multiple, 
heterogeneous document collections. 

Our approach is to provide high quality retrieval of 
information related to important entities in the 
information space. In our organization (a large 
professional services firm), important types of entities 
include people, companies, and consulting skills. A 
review of our largest collections revealed that nearly 
always, references to important entities are fielded 
rather than buried in free-running text. Because the 
same entity can be referred to in many different ways 
across a heterogeneous information space, our entity 
retrieval system normalizes references to entities in a 
heuristic, type-dependent manner. For instance, the 
person names "Mr. Bob Smith", "Smith, Robert", and 
n R. J. Smith" are normalized such that a query for any 
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PW Notes Explorer 

Profile for "IfP" (company) 

SIC Code: 

3570 — Computer & Office Equipment S3 
Net income: $2,433,000,000 @ 
Total assets: $24,427,000,000 S 
Net revenues: S31.519.000.000 0 
SEC Filings: 0 
WWW Home Page: !3 
Client Of: 

Audrey Auditor 

Tom Taxman B 

Courtney Consultant 13 
Vendor Relationship Coordinator Vince Vendrel S 
Analvsts' mentions: 

( 307/21/97 Knowledge Info Transfer Hewlett-Packard Co.: Managing Diversity: Heterogeneous' Client Server Networks Pose New 1$ 
0 3 07/17/97 Knowledge Info Transfer Hewlett-Packard Co.: Finance Function Best Practices: The Hallmarks of a World-Class Finance 
B 07/17/97. Knowledge Info Transfer HP: Corporate Tax Department Survey 

0 3 07/17/97 Knowledge Info Transfer Hewlett-Packard Company: PeopleSoft Global Alliance Partners 

Figure 1: Results of NX Profile Search 



one (or a number of other possible forms) will retrieve 
documents containing any of them. 

We have implemented an entity-based retrieval 
system called NX (for Notes Explorer) that operates 
over a large semistructured information space. The 
space currently includes over one hundred corporate 
Lotus Notes collections and a small set of web 
collections, together containing about 300,000 
documents. NX provides full-text search, entity-based 
document retrieval for people, companies, and skills, 
and profile extraction for people and companies. It is 
delivered over an intranet using HTML. 

A key hypothesis behind this work is that a 
relatively small amount of meta-information - much 
less than that required to normalize and map 
collections to a comprehensive global schema — can 
give a large gain in query power and precision over 
knowledge-free methods like full-text search. NX is 
one illustration of this hypothesis. It requires only a 
modest amount of meta-information about each 
collection - an indication of fields containing entities 
in various semantic categories and pairs of fields that 
stand in specific semantic relations - and uses it to 
produce a dramatic improvement in retrieval quality 
for entity-related queries. Much of the required meta- 
information can actually be inferred automatically 



based on field names and data within the collections, 
using a simple heuristic classifier. 

In what follows, we first motivate the task of 
generating entity profiles with a real-world example. 
Next, we describe the main components of our 
retrieval system. We conclude by discussing related 
and future work. 

2 Entity-based retrieval 

In a corporate setting, information in different 
documents is frequently linked through references to 
entities with business importance, such as people and 
companies. Often, users search for information about 
particular entities (e.g., "What is Bob Smith's phone 
number?" or "Who's the manager for the XYZ Co. 
account?") as opposed to ungrounded, aggregate 
queries across sets of entities (e.g. "Show me all 
managers with more than five clients over $5 million 
in sales"). We designed NX to support this type of 
search. 

Consider a typical example from our organization. 
A staff member is writing a proposal to XYZ 
Company for some consulting work. She needs 
answers to questions like: 

(a) How large is XYZ Company? E.g., what are their 
assets, revenues, etc.? 
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(b) Does our organization have, a prior relationship 
with XYZ? Have we done other consulting work 
for them in the past? 

(c) If so, who did that work, and how can they be 
contacted? 

Each question refers to entities of various types - 
XYZ Company, staff members, phone numbers, etc. - 
and these entities may be referred to differently in 
different documents. Some questions involve 
information that may be found in many collections of 
the same type - e.g., information about prior work for 
XYZ (b) might be found in numerous collections 
containing client engagements. Others involve 
linking information about XYZ with information 
about another entity — e.g., question (c) requires 
finding staff names in documents that list XYZ 
engagements, and then finding contact information for 
those staff names. 

Figure 1 displays the results of a profile search in 
NX given "HP" as a company name search string. 
Normalization allows NX to retrieve information from 
documents that mention "Hewlett Packard", "Hewlett- 
Packard, Inc.", etc., as well as "HP". The headings 
(e.g., "SIC Code:" and "Client Of:") list specific 
values that have the specified relationship to the 
company of interest. These values are drawn from 
multiple documents in different collections; the square 
document icons are hyperlinks to the source 
documents. In the case of values representing people 
and companies, the value (e.g., "Audrey Auditor") is 
also displayed with a hyperlink that initiates a profile 
search on that value. This allows, for example, 
contact information to be found for people who have 
\'HP" as a client. Other headings (e.g., "SEC Filings:" 
and "Analysts* mentions:") are followed only by 
document links, as it is the document as a whole that 
is of interest — not specific information extracted from 
it. 

3 Extracting Entity Profiles in NX 

This section describes the major components of NX 
that are used to support its profile search capability: 

•Semi-automatic field classification. 
•Entity normalization. 
•Definition of a partial global schema 
•Extraction of profile information from entity 
indexes 

•Detection and resolution of profile ambiguity 



' Actual people names have been replaced in the HTML generated by 
Notes Explorer to preserve privacy. 



3.1 Semi-automatic field classification 

To build an index of entity references of different 
types, we must identify where those types occur within 
collections. NX's field classifier uses field names and 
sample values from a collection to classify fields as 
containing entity types (people's names, company 
names, phone numbers, dollar amounts, etc.) and 
identifiable semantic roles that they play within the 
collection, e.g., partner on an engagement, client 
company, or vendor company. The current version 
recognizes person names, company names, telephone 
numbers, geographic locations, office names, and 
dollar amounts. As classification is not 100% 
accurate or complete, a Web browser interface is 
provided to alter the entity and role types for each 
collection's fields. 

3.2 Entity Normalization 

In a standard relational database, tuples from different 
tables that contain information about the same entity 
each contain a key for that entity allowing the tables to 
be joined. In a semistructured document space, 
however, there are rarely unique keys shared by 
collections. Rather, entities are referred to within text 
strings in a variety of formats, with a variety of 
synonyms and abbreviations. 

Therefore, to allow search over entities, entity 
references must be normalized and matched (as in 
[HS95]). For maximum retrieval speed. NX 
normalizes entity references at indexing time. The 
normalization is heuristic, using formatting 
knowledge and synonym tables specific to each entity 
type. NX's entity index stores both the original form 
and a normalized form of each entity reference. At 
retrieval time, a normalized form of the user's search 
string is created and used to retrieve matches from the 
normalized entity index. In some cases, values are 
only partially normalized, and the original forms of 
retrieved matches and the search string are compared 
to verify the match. 

In addition, pre-processing is required to find the 
portions of the input string containing entity 
references. Often, a field will contain multiple entity 
values in a single string, with spurious information 
interspersed. For example, a typical person name 
field value might be "Bob J. Smith Jr. - managing 
partner; Sue Jones, 415-555-1212, Palo Alto." NX's 
normalization routines extract "Bob J. Smith Jr." and 
"Sue Jones" out of this field value. 

NX's field classification and normalization routines 
are described in more detail in [HB97], 
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Relation: WorkPhonei 




P W Address Bank 



Mappings: 
Collection: PW Address Book 

WorkPhone( StaffName , OjficePhone ) 

Collection: US Partner Directory 

WorkPhone( PtnrNm, DirectLineNum ) 



StaffName: 
Bob Nado 
OfficePhone: 

4is-$ss-nn 

Address: 



US Par tner Directory 



PtnrNm: 

Bob Smith 
DirectLineNum: 

604-555-22S2 



Figure 2: Mapping a Predicate to Collection Fields 



3.3 Definition of a Partial Global 
Schema 

The profile search capability of NX is based on a 
global vocabulary for describing the types of 
information that may be found about an entity in the 
different information sources that are available. No 
attempt is made to define a complete global schema 
characterizing all of the relations that might be 
extracted from individual collections. Rather, the 
global schema used by NX is partial - containing only 
enough meta-information to support the desired entity 
profiles. Currently, the global vocabulary includes 
sorted, binary predicates of two types. An entity 
predicate represents a relationship between two 
entities. For example, "Work Phone" is an entity 
predicate representing the relationship between a 
person and a phone number where that person may be 
reached at work. Sorts (entity types) are assigned to 
the domain and range arguments of the "Work Phone" 
predicate — 'Terson" and "Phone Number" — 
restricting the applicability of the predicate. The other 
type of predicate, called a document predicate, 
represents a relationship between an entity and a 
document that is "about" that entity. For example, 
"Resume** is a document predicate relating a "Person" 
and a resume document. 

In addition to declaring domain and range sorts, 
each predicate must be mapped to the relevant fields 
in collections that locally instantiate the predicate. 
For example, in the PW Address Book collection, the 
"Work Phone" predicate is mapped to a domain field 
called "StaffName" and a range field called 
"OfficePhone". Other collections may also have 



information relevant to the "Work Phone" predicate 
but use different fields to record the person name and 
the phone number (see Figure 2). An entity predicate 
may be mapped to multiple pairs of domain and range 
fields in a single collection. Document predicates 
have a simpler mapping, requiring only a domain field 
in each relevant collection. 

Currently, the mapping of predicates to fields in 
collections is performed manually using a Web 
browser interface. The interface narrows the set of 
candidate collections and fields for each predicate by 
exploiting the entity types assigned to fields by NX's 
field classifier. A collection can be ignored when 
mapping a predicate if does not contain fields with 
entity types matching both the domain and range sorts 
of the predicate. Given an eligible collection, 
candidates for the domain and range fields are 
narrowed to those whose entity types match the 
domain and range sorts of the predicate. The interface 
allows the predicate mapping process to be performed 
in a small amount of time, typically less than a half 
hour per collection. 

3.4 Extraction of profile information 
from entity indexes 

A profile for a particular category of entity is defined 
by listing the global predicates that should make up 
the profile in the order in which they should be 
displayed in the results page of a profile search. 
Information can be associated with individual 
predicates through a Web browser interface to control 
the formatting, number, and sorting of profile results 
displayed for the predicates. 
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1. Retrieve matching records from the entity index 



I 



2. Filter out records with incorrect domain field 



I 



3. Retrieve records from the same document 



I 



4. Filter out records with incorrect range field 



I 



5. Son records by predicate and generate HTML 



Figure 3: Profile Information Extraction 

Retrieving an entity profile involves five steps as 
depicted in Figure 3. First, NX retrieve records from 
the entity index whose normalized field values match 
the normalized forms of the search string and that 
have the correct entity type. Next, NX filters out 
records retrieved in Step 1 whose field is not a domain 
field for any predicate in the profile. For each record A 
resulting from step 2 f NX retrieves records from the 
entity index that originate in the same document. In 
step 4, NX filters out records retrieved for A in Step 3 
whose field is not a range field corresponding to A 's 
field as a domain field .for one of the profile 
predicates. Finally, NX sorts the remaining records by 
profile predicate and generates an HTML page 
displaying the results for each predicate. 

Because they have been normalized, the results 
found for a particular profile predicate can be properly 
grouped independently of how they were referred to in 
the source documents. In essence, this uses the 
normalized entity index as a simple data warehouse, 
enabling an aggregation over entities in document 
sets. 

3,5 Profile ambiguity 

Given a particular search string, NX may find 
references in documents to more than one distinct 



entity, each of whose names match the search string. 
For example, when "Bob Smith" is supplied as the 
search string, matches may be found that give 
information about both Robert A. Smith and Robert S. 
Smith. This problem of ambiguous profile searches 
can only be addressed heuristically, as entities do not 
have unique keys across collections. 

In some cases, however, reference lists of entities 
can be used to aid in disambiguation. For person 
names within our firm, for instance, there are "address 
books" mapping each person's name to a unique email 
address. Generally, a PW staff member will have 
exactly one entry in one of the address books that exist 
for each PW firm around the world. If more than one 
match is found for the search string in the collection 
of address books, the user is asked to select one of the 
address book entries in order to refine the search . 
This is illustrated in Figure 4 for a profile search on 
"Bob Smith". 

The selected address book entry often gives a more 
specific search string with which to continue the 
search. In addition, the address book entry may give 
other information about the chosen person (such as 
work office) that may be used to filter out documents 
that contain conflicting information. - 



4 Discussion and Future Work 

As described in [HB97], we have evaluated NX's 
entity-based retrieval through a comparison to 
standard full-text search, finding that it produces 
much more precise result sets than full-text search for 
important classes of queries. To date, we have not 
explicitly evaluated the entity profiling capability. It 
may be difficult to use traditional IR evaluation 
metrics like precision and recall over such a large and 
diverse information space. Rather, we plan to 
evaluate profiles' usefulness to end users, through user 
feedback and surveys. 

The goal of our work is to provide better information 
retrieval across a large semistructured space than full- 
text search, while avoiding excessive meta- 
information overhead. Our approach is based on 
observing that in an information space used by a 
particular organization, important entity types link 
information together and can be used as a central 
retrieval cue. This data-driven approach can be 
contrasted with schema-driven approaches used by 
multidatabase systems (e.g., [ACHK93]), and similar 
systems attempting to integrate structured world-wide 
web sources [LR096, FDFP95]. In schema-driven 
approaches, each local schema is mapped to a central 
global schema, and mapping rules are used to 
translate between data formats used by different 
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BNSDOCID:<XP 2193377A I > 



PW Notes Explorer 




Cho o $e a p erson to profile: 

• E H Robert Smith : PW Hobart, Robert Smith (PW Australia Address Book) Select 
^Rob Smith: Rob Smith; UK, Southampton (PW Europe Address Book) Select 

Robert S. Smith : Austin, Texas; Robert S. Smith (PW Name & Address Book) Select 



Figure 4: Ambiguous Profile Search 



sources (e.g. [CHS91]). These approaches are 
appropriate for relatively small numbers of tables 
where the data within each table is well-specified; 
however, semi structured information spaces can 
include hundreds of sources, and data even within 
single sources can have multiple formats. A schema 
integration phase would be burdensome in such a 
large space [GMS94]. Instead, NX relies on heuristics 
to categorize fields into a small number of entity and 
role types, and normalizes entity values for retrieval. 
The resulting retrieval system makes it practical to 
encompass a greater number and variety of data 
sources than multidatabase systems, although the 
query language is less general because queries must 
refer to a specific entity. 
Topics to be addressed by future work include: 

• extending profiles to other entity types such as 
service lines and skills, 

• customizing profiles to meet the requirements of 
particular classes of users, 

• using information about recency and reliability to 
resolve conflicts in information retrieved as part of a 
profile, e.g., multiple office phone numbers retrieved 
for a person 

• performing inference in the determination of 
profile results that combines information from several 
documents, e.g., determining a person's office 
telephone number from his assigned office and that 
office's main switchboard number 

• developing automated techniques for mapping 
global schema predicates to pairs of collection fields 
by exploiting abstract classifications of collections, 
e.g„ "directory" collections are more likely to contain 
a person's phone number; client engagement archives 



are more likely to contain the names of a person's 
clients. 

• extending the information available as part of a 
profile by developing additional extraction and 
summarization methods, e.g., producing a summary of 
a person's key skills from resume documents 

6 Conclusion 

Semistructured systems are an intermediate point 
between unstructured collections of textual documents 
(e.g., untagged Web pages) and fully structured tuples 
of typed data (e.g., relational databases). Based on 
observing how information is typically retrieved and 
used within our organization, we have developed an 
entity-based retrieval system over a large 
semistructured i nformation space. The system 
incorporates semi-automatic classification of fields, 
normalization of field values, and structured retrieval 
of commonly required information in the form of 
entity profiles. For typical queries containing entities, 
the system provides much more focused and 
normalized retrieval than full-text search. 
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NoDoSE - A tool for Semi-Automatically Extracting Structured 
and Semistructured Data from Text Documents. 

Brad Adelberg* 
Abstract 

to JaxHs diribed and experiences parsing a variety of documents are reported. 
Keywords: data extraction, semistructured data, structure mining, wrapper induction. 

1 Introduction 

The amount of useful semistructured data [Abi97] on the web continues to grow at a torndp^- 
U^ould like to gain conventional database system 

* a „,„,rvin* and reoortine. This has spurred a recent flurry of work {KWD97,AK97a,HGMO S7, 
SC^Cp. arounlLch sources, either manually or with software assistance 
"the new data within the reach of general query tools. It is important to note, however that 
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J2 butTn text files on local file systems. Examples are mail files, code and code documen ation 
fJnZ^n files Ws of program activity, phone lists, etc. Further, there >s a huge collection of 

■rS^SlSJL tags. TlL if we want to extract all of the data *• 
through a query interface, we need to focus on something more general than HTML files plain 
text fits SincI HTML files are a special case of text files, a tool that handles text files will also 

^ILtTng fn^rnation from text files is harder than for HTML files for three reasons: 

1. Since text files do not generally contain markup tags, there are usually fewer structural clues 
and those that are present are not known a priori. 
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Figure 1: User level architecture. 
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input into a comma delimited file suitable for importation into a data analysis program or 
spreadsheet. 

2. If the extracted data is to be stored in a database, a schema file (if the data is structured) 
and a load file suitable for the load utility that comes with the DBMS can be generated. 

3. If the input documents are to be exposed through a query interface, the Lex and Yacc code 
needed by a wrapper can be generated. 

In this paper we describe the design and implementation of this extraction tool — NoDoSE, the 
Northwestern Document Structure Extractor. We cover both the NoDoSE architecture (in Section 
3), which can be used as a test bed for structure mining algorithms in general, and the algorithms 
for inferring parsing rules that we have developed (in Section 4). A first version of NoDoSE has 
been implemented and is described in Section 5 along with a description of our experiences using 
it to extract data from a variety of documents. 

2 Example 

To illustrate the use of NoDoSE we consider the problem of analyzing the results of simulation 
experiments. Although recent work on data extraction [KWD97, AK97a,HGMC + 97] has focused 
on web pages, we choose simulation output here since it is meant to be human readable and not 
program readable. Thus it has fewer structural clues (such as tags) and is more difficult to determine 
parsing rules for. Also, we anticipate that a system capable of inferring parsing rules for plain text 
documents will be able to handle HTML documents as well. 

An example of human readable simulator output (which was generated by DeNet [Liv90]) is 
shown in Figure 2 1 . Most tools for storing and analyzing data, such as database systems, spread- 
sheets, and plotting programs, cannot handle files this complicated. Instead, the output file must 
be converted into a more regular file (i.e. tabular) before it can be processed. Typically, this con- 
version is performed by a hand-coded program in awk, sed, perl, or some other scripting language. 
Using NoDoSE, the conversion can be performed quickly and without any coding expertise. 

There are three steps to the conversion process: 

1. Decide on how to model the data in the documents. 

2. Hierarchically decompose the files, mapping regions of the files into components of the chosen 
model. 

3. Specify how the extracted data is to be output. 
We describe each of the three steps in the sections below. 



l Tbe output as shown is slightly modified from the original. For illustration purposed, node results relating to 
confidence intervals were removed since they cannot be represented as a <average,stdev,num> triple. NoDoSE can 
parse the original files if a different data model is chosen. 



DeNet(V1.6) simulation started on Mon Sep 5 14:41:31 1994 

Attributes of node # 0 (Principal) 

simTime - 5.000000E+02 
Attributes of node # 1 (usource) 

arrivalRate - 2.O0OOO0E+O2 

meanSkevL - 1.00000OE-01 

meanSkewH - 1.000000E-01 

Attributes of node # 5 (tsink) 

batchSize - 100 

conf idenceLevel - 9 .500000E-01 

confidencelnterval - 1.000000E+00 
Sample Results Node # 1 (usource) 

1 numUpdates - (avg) 1.000000E+00 - (std) O.O000O0E+00 - (num) 100431 
Sample Results Node # 2 (tsource) 

2 numJobs - (avg) 1.000000E+00 - (std) 0.000000E+00 - (num) 2455 

S ample Result ~s"Node # 5 (tsink) 

5 misDL - (avg) 9.295315E-01 - (std) 2.559869E-01 - (num) 2455 

5 missedStale - (avg) 9.295315E-01 - (std) 2.559869E-01 - (num) 2455 

DeNet(V1.6) simulation terminated on Mon Sep 5 14:42:24 1994 
Total CPU usage 0.896 Minutes, (user 0.894 ; system 0.002 ) 



Figure 2: Simulation output example. 
2.1 Modeling the documents 

Before extracting data from the documents the user must decide how to model the data. One 
possibility for the data, and the one that will be used in this example, is shown in Figure 3. 
Documents are of type SimulationRun and contain three top-level components: a timestamp, a list 
of input parameters for each simulation node, and a list of measured results for each node. The 
parameters for each node (part of NodeParams) are represented as a list of <name,value> pairs. 
This has the benefit of a very regular structure but the disadvantage that the type information 
about parameters is being lost since every value is modeled as a string. The structure is so regular, 
in fact, that the output of any simulator written in DeNet can be modeled using this schema. 

Instead of representing all of the different simulator nodes in a generic way, we could also create 
a new class for each. This solution has the benefit that we can model the particular parameters of 
each node and their real types. It also has two problems: any change to the simulator will require 
a change to the model, and the grammar derived for the output of one simulator cannot be used 
on the output of another DeNet simulator. Thus because it is more general and because it is more 
difficult to parse, we will use the model from Figure 3 in this example. NoDoSE, however, can work 
with either. 



4 



interface NodeParams { 
attribute int node_number ; 
attribute string node.name; 

attribute List<Struct OneParam {string name, string value} > parameters; 
> 

interface NodeResults { 
attribute int node.munber ; 
attribute string node.name; 

attribute List<Struct OneResult {string name, real average, real std, int num}> results; 
} 

interface SimulationRun { 

attribute String times tamp; 

attribute List<NodeParams> node.params; 

attribute List<NodeResults> node. re suits ; 

} 

Figure 3: Example schema for the simulation output. 



2.2 Decomposing the documents 

The decomposition process begins by loading a single document into NoDoSE. The user then 
hierarchically decomposes the document using a GUI. Next additional documents of the same type 
are loaded in to the system and automatically parsed. Any errors are corrected by using the GUI 
and reparsing. The process is complete when all of the documents have been successfully parsed. 

The first step in decomposing a document is indicating its top level structure, in this case a 
record of type SimulationRun. Next, we add each of its three fields (timestamp, node_params, and 
node-results) by selecting the relevant portion of the text in the document window and clicking on 
the add structure button in the tool bar (Figure 4). The type, type name, and label of each field 
can be entered using the controls on the bottom portion of the window. Since node.params and 
node-results fields are complex types (lists), the decomposition process must continue. 

Suppose the user chooses to decompose the list of node results next. Double-clicking on that 
node in the tree view panel will display only the portion of the document mapped to the node-results 
list. The user then selects the text of the first element of the list (the first two lines) and adds 
this as a structure. Next, the second element of the list is added. Figure 4 shows a snapshot of 
the interface at this point. The user could continue to add every element in this manner but this 
would become tedious if there are many elements. Instead, the user can ask NoDoSE to try to infer 
the remaining elements by mining the text. If the tool mistakenly identifies elements the user can 
correct a few of the errors and ask that the text be remined. In this way, the correct grammar for 
the component will eventually be learned. Once it is, NoDoSE is able to identify all of the other 
elements of the list correctly. 

The decomposition process must continue since each element of the node results list is a record of 
type NodeResults. Any of the list elements can be selected and its fields added. Figure 5 shows the 
screen after the third element has been decomposed. As before, the user does not have to decompose 
every element by hand. Once a few elements (in this case, one) have been decomposed, the miner 
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can be again invoked to decompose every record of type NodeResults. Since the parameters field 
of NodeResults is itself a list, the process must continue (node_params must also be decomposed). 
The process continues until all of the leaves of the document tree are atomic types. 

After the grammar for a particular file has been determined, NoDoSE is loaded with all of the 
other files of the same type. These are automatically parsed using the grammar inferred from the 
first file. It's possible, though, that parsing fails on one of the additional files for one of two reasons. 
First, the additional file may contain an error such as a mistyped field name or an OCR error. In 
such a case, the user can correct the error through the GUI but the grammar does not need to be 
updated. The second reason why parsing may fail is that the additional files contain something that 
was not present in the first file parsed. For example, suppose that the files come from two different 
versions of the simulator and that the newer version measures and outputs an additional value. If 
a file from the old version was used for the initial parsing process, NoDoSE will fail to recognize 
the new field. In this case, the user must correct the parsed tree for the new file, describing the 
new field using the GUI as before. The extractor will then update its grammar to account for the 
new field. After this, any of the files with the newly described field will be automatically reparsed. 
When all of the files of the same document type have been successfully parsed, the first step of the 
conversion process is complete. 

2.3 Outputting the extracted data 

The final step of the conversion process is to specify how to output the data that has been extracted 
from the parsed files. As shown in Figure 1, different options are supported. One option is to write 
the data into a text file. The format of the file and which data to be output is specified using a 
simple GUI-based report generator. The intent of this component is not to replace the querying 
and reporting functions of a DBMS but to provide a quick means of writing simple files, such as 
comma or tab delimited tabular data for input to spreadsheets and the like. For users who need to 
perform more complex operations on the data, NoDoSE can generate a schema file and a load file 
for use by a load utility provided by a third party DBMS. At the present, the generated schema 
file is ODL-like and the load file is in a generic format of our design. Additional formats can be 
added either by using the report generator or by coding an additional report component. 

3 System Architecture 

This section describes the internal architecture of NoDoSE in two parts: it first describes how the 
structure of documents is represented and then gives an overview of the components that comprise 
the system. 

3.1 Document Model 

Externally, documents are represented as flat files which serve as the input to NoDoSE. Internally, 
however, we need to be able to store information about the structure of the documents. Hence for 
every file that is loaded by the user, NoDoSE maintains a tree that maps the structural elements 
of the document to the text of the file (Figure 6(a)). Each node of the tree represents one of the 
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Figure 4: Screen shot of NoDoSE after a few steps. 
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Figure 5: Screen shot of NoDoSE later in the process. 
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structural components of the document such as an element of a list or a field in a record. The 
following values are stored in each node: 

typeName - Every node in the tree, and thus every component of a document, must be of either 
an atomic type or a named composite type. Details of the type system supported by NoDoSE 
are described below. 

startOffset, endOffset - These two values indicate which portion of the file corresponds to the 
structural component. For non-root nodes, the offsets are relative to the start of the parent 
node's region. 

label - The only required use of this field is in the children of record nodes to indicate which field 
the node represents. Labels can also be used to represent data in a schema-less model such 
as OEM (CGMH+97). 

authorld - This identifies the creator of the node, the user or one of the mining components. 
Maintaining the originator of a node is useful when mining structure since user identified re- 
gions can usually be given greater credence than regions identified by the mining components. 

confidence Value - -This is a value between 0 and 1 indicating how confident the author is that 
this node is correct. It is typically set to 1, meaning complete confidence, for nodes added by 
the user and is set to a lower value for nodes inferred by one of the mining components. One 
practical use of the confidence value is to alert the user about nodes that may not have been 
parsed correctly (see Section 4.1.1). 

To clarify the mapping process, consider the file shown in Figure 6(a) that is composed of just 
a single line. We can view this file as a list of names, each name being composed of a first and last 
name. This structure would be represented by the tree shown in Figure 6(b). The root of the tree 
represents the whole document and is mapped to the entire file. The root is of type Doc which is a 
list of objects of type Name. The root has three children, each corresponding to one of the names 
in the list, and in the same order as the corresponding names appear in the document. Each child 
is of type Name, which is a structure with fields for the first and last name. Of course each node 
has a different pair of offset values to indicate where in the text the element is. 

Finally, each node of type Name has two children corresponding to the two fields in the record. 
Each child is of type String which is atomic and hence they have no children themselves. Also, 
each child node has a label that identifies which field of the parent node it contains the value for. 
Note that all of the firstName nodes have a start offset of 0. This is because offset are relative to 
the parent node's text (and, in this case, the first name value begins every list element). 

Every node in a document tree must be associated with a particular type. NoDoSE predefines 
six atomic types: Integer, Float, String, Date, EmailAddress, and URL. Additional atomic types 
can also be added — the user need only supply a name. Complex types can be defined as well 
using the following common type constructors; 

1. NewType = Sei<OldType>, 

2. NewType = Bag<OldType>, 

3. NewType = L\st<OldType> , 
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type Name = Doc 
siartOffset = 0 
endOffset = 40 



Type 


Definition 


Doc 


Usi<Name> 


Name 


Record (String First. String last} 



typeName = Name 
startOfTset = 6 
endOfTset = 16 




typeName = String typeNamc = String 

stanOffset *= 0 stanOffset = 5 

endOffset = 4 endOfTset = 1 0 

label = first label = last 



typeNamc = Name 
stanOffset = 17 
endOffset = 28 




typeName = String typeNamc = String 

stanOffset = 0 stanOffset s 5 

endOfTset «= 4 endOffset = 1 1 

label = first label = last 



typeName » Name 
stanOffset = 29 
endOfTset = 40 




typeName = String typeName » String 

startOffset «= 0 stanOffset e 6 

endOffset = 5 endOfTset ■* 1 1 

label = first label = last 



(b) Document tree for example file. 



Figure 6: Representation of a document. 

4. NewType = Record{0/dTypei fieldNamei 1 OldType2 fieldName2,...}. 

Note that unlike some type systems, only singly nested types can be defined in a single step. Thus 
defining the new type List < Record {String first, String last}> would require two new types, one 
for the record and one for the list. 

In addition to the structured type constructors, NoDoSE provides an analogous set of type 
constructors for semistructured data: SemiSetO, SemiBagO, SemiListO, and SemiRecord{ 
fieldName\ y fieldNarne2f.}. These constructors do not restrict the type of their components so, 
for example, all of the elements of the list do not have to be of the same type. Note that many of 
the previously proposed models for semistructured data do not require all four constructors. For 
instance, OEM[CGMH + 97] objects can be represented using only atomic types and SemiList. We 
have added them, though, for cases where more semantic information is known. 

Having covered the type system we can now define what constitutes a legal document tree. For 
a tree to be legal all of its nodes must be legal. A node n is legal if and only if all of the following 
conditions hold, the first four of which are related to type restrictions and the last of which is 
related to mapping: 

1. if n is an instance of an atomic type it cannot have any children; 
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2. if n is an instance of a structured collection (List, Bag, or Set) all of the children of n must 
have the same type; 

3. if n is an instance of a Record type defined as the set of fields F = {< >, < *2, h > 
, ...,< tmyfm >} and n has the children ci,C2,...,c*: 

(a) (Vi)[< Ci.typeName y Ci .label >€ F] 9 

(b) (Vi,y)[(l < i < k) A (1 < j < k) A (a.fieldName = Cj.fieldName) (t = j)}. 

4. if n is an instance of a SemiRecord type defined as the set of fields F = {/i, /s, ...» /m} a-nd n 
has the children Ci,C2, .»»c*; 

(a) (VOK^^afre/jGF], 

(b) (Vt, < i < A) A (1 < j < k) A {cidabel = cj. label) -> (i = i)]. 

5. let p be the parent of n, Z its left sibling, and r its right sibling. The following must hold: 

(a) 0 < n.startOffset < n.endOffset < parentLength where parent Length is taken to be 
p.endOffset — p.startOffset if p exists and the length of the document otherwise; 

(b) if / exists, LendOffset < n.startOffset; 

(c) if r exists, n.endOffset < r.startOffset. 

It is the responsibility of the Instance Manager and Document Manager, described below, to ensure 
that every tree instance is legal. 

3.2 Components 

NoDoSE is intended to be a test bed for studying the data extraction problem. Thus, rather 
than build a monolithic tool and force other researchers to wade through thousands of lines of 
source code, we've designed the system as a set of components that communicate through Java 
interfaces. Any of the components can be replaced independent of the others, and for certain types 
of components, more than one can be instantiated at any given time. At the present, changing 
components still involves changing a few lines of code in the top level NoDoSE class but with in 
the next version we plan to use the Java Reflection class to allow components to be changed or 
added dynamically without any code changes. 

Figure 7 shows most of the components in NoDoSE and how they interact. The reporting 
component has not been shown in the interest of readability but will be discussed below. Most of 
the components provide the basic infrastructure for the system: reading files, maintaining document 
trees and type information, and supporting undo/redo. We expect that these components, which 
we collectively call the support components, will rarely be the subject of experimentation. The 
remaining three components are those most likely to be changed: the structure miners, the report 
generators, and the GUI. 

Allowing third parties to build components that modify the data is dangerous since they may 
accidentally violate constraints. For example, a GUI may allow the user to add a child to an 
atomic type. To avoid these problems we have adopted the model-view-controller (Gol90,KGP88) 
paradigm. The model, which stores the data and enforces constraints, is maintained by the support 
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Figure 7: Internal architecture. 



components. The other three components types, Reporters, Miners, and GUIs, all provide views of 
the data in the model. In addition, the miner and GUI both serve as controllers since they modify 
the data. This must be done through the model, however, which guarantees that no constraints 
can be violated. If a controller performs an an action that would violate a constraint, an exception 
is raised by the model which can be trapped and handled by the controller. 

Because there can be many views on the model, the core components also support the ob- 
server/observable paradigm [GEH+94]. For example, the type manager allows other components 
to register as observers of a particular type. Whenever a new instance is added to that type or an 
instance deleted, the observers are notified. The mining component described in Section 4 uses this 
notification to incrementally maintain its statistics about a given type. 

Below we describe all of the components of the system. Details of the version 1.0 implementation 
can be found in Section 5. 

File Manager - Enables sections of a file to be read or modified by the other components. Modifi- 
cations are not performed directly on the original file; a separate file of changes is maintained. 
This ensures that NoDoSE cannot corrupt an input file and that it can work with read-only 
files or files residing on remote machines. There is a one to one mapping between file managers 
and files. 

Instance Manager - Maintains the document tree for a file, providing all of the basic tree ma- 
nipulation operations, such as node insertion and deletion. It also provides methods that map 
the tree to the file or vice versa. For example, when a user double clicks in the document 
text panel, the GUI can use the instance manager to find the tightest bounding node for the 
point in the file so that it can display its type information. A- particular instance manager 
stores only a single tree so every file must have its own. 
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Document Manager - Maintains information about a class of document (i.e. the output of a 
particular simulator). In particular, it stores a list of all of the files of a particular document 
class, as well as information on all of the types used in the files. The document manager 
stores four pieces of information for each type in the system: its name, its definition, a list of 
all of the nodes of the type, and a list of observers, which are the components that want to 
be notified when any of the type information changes. 

Log Manager/TVansaction Manager - Supports user undo/redo. 

Structure Miner - Attempts to automatically determine how to parse a given node type. This 
component is described in much more detail in Section 4. 

GUI - Unlike in many systems, the GUI in NoDoSE is truly a replaceable component. In fact, 
we are currently working on an alternate user interface for structured documents through 
which the user first specifies the schema of the document in ODL and then maps the text of 
documents to the schema. In a sense, this is the opposite of the current approach. 

Reporter - Outputs extracted information, usually as a report, a load file, or as information 
needed by a wrapper generator. Multiple Reporter components can be active within one 
system to give the user different output options. 

4 Mining for Structure 

This section describes the two mining/parsing components that have been implemented so far: one 
that mines text files and one that parses HTML code. Both components are limited in scope; The 
text miner only handles structured types and the HTML parser does not handle frames or other 
advanced features. Despite their limitations, however, we have been able to extract data from an 
interesting set of documents. Further, building two different mining components has forced us to 
ensure that the interfaces exposed to the mining components are powerful enough and clean enough 
to support different algorithms. Details of both mining components appear below. 

4-1 Plain text miner 

The component described in this section attempts to determine the parsing rule for instances of 
a type. The particular type being mined in any invocation is called the target type. For this first 
version of NoDoSE, we chose to concentrate on mining structured types (Set, Bag, List and Record) 
since the type of the children nodes are known by definition. After we develop robust algorithms 
for this mining problem we plan to study the mining of semistructured types. 

Another simplification in the current version is that we use the same algorithm for all of the 
collection types (Set, Bag, and List) since the semantic differences between them are rarely notice- 
able at the level of the format of the text file. Thus, for the remainder of the section, when we 
discuss mining Lists our comments will be equally valid for mining any collection type. 

The algorithms for mining lists and records are both based on the same overall three step 
strategy: 
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Meaning 


Notation 


Begin with marker 


[With Marker "marker"] 


Begin after marker 


[After Marker "marker"] 


Begin at fixed offset 


[Offset offset] 


End with marker 


[With Marker "marker") 


End before marker 


[Before Marker "marker"] 


End after a fixed # of lines 


[After Lines numJines] 


End at fixed offset 


[Offset offset] 



Table 1: Parse rule components for the plain text miner. 



1. Theory generation - Create a set of theories for how to parse the instances of the target 
type. For a list type, a simple theory is "each element will be separated by a comma". 

2. Theory evaluation - For each theory under consideration, 

(a) Parse every node that is an instance of the target type (a list of which can be retrieved 
from" the Document Manager) ttrgenerate a list of predicted children-nodes. Note that 
if multiple documents are loaded, not all of the nodes will necessarily be from the same 
document. 

(b) Compare the predicted nodes to the nodes that are actually present in the document 
tree. Count the number of nodes in the document trees that were not predicted, which we 
call false negatives, and the number of predicted nodes that cannot possibly be correct, 
which we call false positives. 

3. Theory application - If one or more of the theories has no false positives or negatives, 
pick one of them and add its predicted nodes to the document tree (or trees, if more than 
one document is loaded). 

Although we will need different types of theories for parsing lists and records, the two share common 
elements: In each case, we are trying to subdivide the portion of the file corresponding to a given 
node, which we call the node text % into smaller units, each either a list element or a record field. To 
find the boundaries of the units we will need two theories: a theory about how the beginning of a 
unit is determined, called a start theory^ and a theory about how the end of a unit is determined, 
called an end theory. We call the combination of a start theory and an end theory a unit theory. 

Table 1 lists all of the start and end theories used in the current implementation of NoDoSE's 
text mining component. Most include a variable that must be instantiated, often a marker which 
is a string that separates units. For example, in the file from Figure 6, the best start theory would 
be [After Marker ","]• We could use something similar for the best end theory: [Before Marker 
",")- To represent the resultant unit theory, which is the combination of the start and end theory, 
we will write <[Before Marker K ,"],[Before Marker ","]>• Rather than explain the meaning of 
the rest of the theories here, we introduce the concrete problem of parsing lists to give the discussion 
more context. 
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4.1.1 Mining lists 

Mining a list type entails two tasks: find a parsing rule that identifies every element of a list of the 
target type and then parsing every instance of the target type using the rule. Figure 9 shows the 
three instances of a target type, Roster = Li$t<String>, that will serve as a running example for 
this section. The lists is meant to represent the rosters of basketball teams 2 . The boxes around 
parts of the lists indicate the elements that the user has already identified. 
The mining algorithm for list types depends on three assumptions: 

1. Every element of the list will have the same type. Since this algorithm will only run on 
structured collection types, this assumption must hold. 

2. Every element of the list will have the same format. This assumption does not necessarily 
hold but seems reasonable given assumption 1 and simplifies the grammar induction. 

3. If k elements of a list have been identified by the user, they will be the first A: elements in the 
list. This is a very powerful assumption because it gives the miner a way to identify theories 
that generate false positives — no predicted unit can appear before any preexisting unit in a 
list. It does, however, impose restrictions on how structure information must be input by the 
user. 

Of course all three assumptions hold for our example. The first holds by definition: all of the 
elements are of type String. The second holds as well since each element has the basic format 
"Player Name: name". The third also holds since list 1 has all of its elements specified, list 2 has 
none of its elements specified, and list 3 has only one element specified but its the first element. 
An example of a violation of assumption 3 would be if list 2 had the player named Hill specified 
without having Dumars specified as well. 

Lists can be viewed in general terms as a header followed by the elements of the list separated 
by gaps. Of course, specific list types may not have headers or gaps at all. Figure 10 shows how 
list 1 of our example fits this pattern. For any list with at least one element defined, we know 
the boundaries of its header — everything to the left of the first element by assumption 3 above. 
Also, everything between two defined elements is necessarily a gap since no other element could 
exist between the two according to assumption 3. Thus even if some of the list instances have no 
elements defined and others have only some of their elements defined, the miner will usually still 
be able to identify a few headers and gaps (if the list text has them). In our example, the headers 
of lists 1 and 3 are known as is the gap between elements 1 and 2 in list 1, 



If the user were interested in capturing the name of the team as well he would probably not define the example 
strings to be Lists at all. Instead, they could be defined (in two steps) as Record{String teamName, List<5(rino> 
playerNames} . 
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List 1 * 
List 2' 
List 3- 



Team Name: Hawks; Name: Anderson 



Name: Blaylock 



Team Name: Pistons; Name: Dumars Name: Hill Name: Hunter 
Team Name: Bucks; Name: Allen Name: Brandon 



Figure 9: Example lists. 



Header 




Gap 




Gap 




Team Name: Pistons; Name: 


Dumars 


Name: 


Hill 


Name: 


Hunter 



Element Element Element 



Figure 10: Presumed format of a list. 



The task of the miner is to generalize from these examples to discover how to identify the 
headers and gaps in ail of the lists. This will require two types of theory: an end theory for finding 
the end of the header (if present), which we will call the header theory, and a unit theory for finding 
the beginning and end of the elements of the list, which we will call the element theory. 

Simplified psuedo-code for mining algorithm used in NoDoSE is shown in Figure 11. The code 
employs the three steps described in the preceding section: theory generation (lines 1-2), theory 
evaluation (lines 3-28), and theory application (lines 29-32). Each of the steps is discussed in detail 
below. 

Theory generation For lists, we have two types of theories to generate, header and element 
theories. Let us first consider how to generate the set of header theories, 7j/. We begin with 
the end theories from Tfcble 1. Next, we instantiate the placeholders in the theories which means 
choosing markers, oflsets, and numbers of lines. For example, consider instantiating the marker in 
the theory [With Marker "marker"]. To do so, we need to examine all of the known headers and 
find their longest common suffix. In the running example, the longest common suffix is the string 
Name: w and so the instantiated theory, [With Marker Name: "J is added to Th* (In fact, 
the prototype actually adds two almost identical version of this theory, one in which the marker is 
case sensitive and one in which it is not, but this does not affect the algorithms in any fundamental 
way and hence will not be further mentioned.) 

Often there will not be any consistent value with which to instantiate a theory. For instance, 
there may be no common suffix at all or the headers may not all have the same number of lines in 
them. In this case, the theory is not added to Ht- 

Generating element theories is similar to generating header theories except that element theories 
are composed of both a start theory and an end theory — header theories do not require start 
theories since they always begin at offset 0. With the possible exception of the first and last 
elements in a list, an element can be viewed as shown in Figure 10. A pre-gap is the gap between 
the element and the preceding element and a post-gap the gap between the element and the following 
element. The theories for parsing list elements are based on trying to find a common suffix in the 
pre-gaps, a common prefix in the elements, a common suffix in the elements, or a common prefix in 
the post-gaps. In the example lists, the common prefix and suffix of the gaps are both M Name: " 



and the elements do not have a common start or end marker. Thus the set of valid start theories, 
T s tarii is {[After Marker " Name: "]} and the set of valid end theories, T en< i is {[Before Marker 
" Name:"]}. The set of candidate element theories, Tfe, is computed as T start x T^, which in this 
case is just {[After Marker " Name: [Before Marker w Name:"]}. 

Theory evaluation This step represents the majority of the code in Figure 11. Conceptually, 
the code chooses the best header theory and then uses that theory in trying to find the best unit 
theory. This is a bit of a short cut since the algorithm should really consider all possible pairs of 
header and element theories. Unfortunately, this is very expensive with even a moderate number 
of different theories. Hence we choose to separate the two tasks, realizing that in some cases we 
may miss the best pairing. 

The best header theory is determined in Lines 3 through 10. The algorithm tries each theory, 
using it to predict the headers of all of the list instances. For those lists that have at least one 
element defined, the header is known and can thus be checked against the predicted header. If' 
one of the theories correctly identifies all of headers it is assumed to be correct and saved for use 
in element parsing. If no theory is correct, we assume that headers do not have to be specially 
handled for the target list type. 

In the example from Figure 9, it is critical that the headers are handled. By skipping past the 
header, the element theory <[After Marker " Name: [Before Marker w Name: "]> can be 
used to correctly identify every element. If the header is not skipped, the same element theory will 
predict that the first element of the list starts with the w Name: " marker that is part of "Team 
Name: 9 and thus mining will fail. If the headers were "Team: " instead of "Team Name: 
however, the element theory would work even if headers were not skipped. Thus many lists do 
not require a header to be found at all and therefore the mining algorithm continues even if no 
consistent header theory can be found. 

The code from Lines 11 to 28 uses each element theory to predict the elements in all of the 
lists. For each list, the search for elements starts at the first character in the unit text unless a 
header is present in which case the search begins immediately after it (line 17). The start theory 
is used to find the predicted beginning of the next element (line 19), and if its successful, the 
end theory is used to find the predicted ending of the element (line 21). If a new element is 
found it is added to the predicted set. The search continues in the same unit text until no more 
elements are found. At this point the predicted elements are compared again the elements defined 
by the user. The function JindFalseNegativesipredictedSet^actualSet) counts the number of user 
defined elements that were not predicted by the theory, \actualSet — predictedSet\. The function 
}indFalsePositives{predictedSet y actualSet) counts the number of predicted elements that must 
be incorrect, which are the new predicted elements that start before the last user defined element 
ends (by assumption 3). For example, in Figure 9 if a theory predicted that an element other than 
"Allen" started prior to the space after "Allen" in list 3, it would have to be incorrect. For list 2 
which has no user defined elements, however, no predicted element can be eliminated. 

Theory application As long as at least one consistent element theory has been found, the 
elements that it predicted and that were not in the original document trees are added using the 
Instance Manager (lines 29-32). The new elements are of the same type as all of the other children 
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(1) Create and instantiate the set of header theories, T//. 

(2) Create and instantiate the set of element theories, Tb- 

(3) th>„ t = null 

(4) for each header theory, t/j, in Th 

(5) for each list / in L 

(6) offset - findHeaderEnd(t hi l) 

(7) if (/ has a header and offset ^ LheaderEnd) then 

(8) th*errors++ 

(9) if (terrors == 0) then 

(10) t h ^ si = t 

(11) *e>.„ = null 

(12 for each element theory t e in Ts 

(13) for each list I in L 

(14) offset = 0 

(15) predicted = {} 

(16) if (tA*..« * null) then 

(17) offset = ftndHeaderEnd{th>„ t J) 

(18) while (offset < Llength) 

(19) start = offset = findElementStart(t ef /, offset) 

(20) if (start ^ -1) 

(21) end = offset = findElementEnd(U>L offset) 

(22) if (end 7* -1) 

(23) predicted = predicted U {< /, start, end > } 

(24) t e . predicted = t e .predicted U predicted 

(25) t e .f<risePos + = findFalsePositives(predicted t Lelements) 

(26) t c .fcdseNeg + = findFalseNegativesipredicted, Lelements) 

(27) if (teJalsePos == 0 A U.falseNeg == 0) then 

(28) t^„ = te 

(29) if (t tft ., ( ^null) then 

(30) For each element < l.start.end > in U h „ t . predicted 

(31) if < start, end Lelements 

(32) Lelements = Lelements U {< start, end >} 



Figure 11: The (simplified) algorithm for mining and parsing list types. 



17 



of the target type, the author ids identify the text miner as the author, and a confidence factor 
can be assigned. In this version the confidence factor of the new elements is set to 0.5, but in the 
future we plan to compare each element against statistics gathered on all of the elements to try to 
identify questionable predictions. For instance, if all of the elements are 40 characters long except 
for one which is 80, it is likely that due to a an error in the parsing rule or a typo in the document, 
two consecutive elements have mistakenly been parsed as one. 

4.1.2 Mining records 

The mining algorithm for record fields depends on four assumptions: 

1. Every field in a record has a unique name. This assumption is enforced by the Instance 
Manager. 

2. If the fields of two . different records of the same type have the same name, the two fields 
themselves will have the same type. Such fields are called corresponding fields. For example, 
if two record of the same type both have fields named phoneNurnber, the two phoneNumber 
fields should have the same type. This assumption is also enforced by the Instance Manager. 

3. All corresponding fields will have the same format. This assumption is not forced on the 
mining component but it seems reasonable given assumption 2 and simplifies the grammar 
induction. 

4. The fields in a record instance are either completely identified by the user or not identified 
at all. Thus if k fields of a record instance have been identified by the user, they will be the 
only k fields in that instance. This is a very powerful assumption because it gives the miner a 
way to detect a parsing theory that generates false positives — a predicted field must appear 
in a record if the user has identified any fields at all in that record. 

Note the assumptions that this component does not make: every field is present in every record 
instance, and the order of fields within a record is fixed. Thus we are able to parse a limited but 
useful class of sernistructured documents. 

Mining lists and records are similar except for one important difference. For lists, we assume 
that the format of every element is the same. This assumption allows the list mining algorithm 
to consider only two sets of theories, one to skip past the header and one to identify elements. In 
contrast, every field in a record type may have a different format and thus every field requires its 
own set of theories. Further, the order in which the algorithm tries to parse the fields is important. 

For example, consider the text of a record that contains, among other things, the string "Name: 
Smith, John.". Suppose the user chooses to model a name as two fields, LastName and FirstName. 
The best theory for identifying a LastName field might be <[After Marker "Name:"], [Before 
Marker ","]>- If we know that a FirstName field always follows LastName. its rule would be 
<[After Marker ","], [Before Marker "."]>. We must be careful, though, to only try to apply 
the unit theory for a first name immediately after parsing a last name. Otherwise, an unrelated 
comma anywhere is the text of the record would lead to the first name being falsely parsed. 

To avoid problems of this sort, the mining algorithm tries to find an order for the fields in a 
record type that is consistent across all of its instances. This is difficult for two reasons: 
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(1) Pick a consistent order for the fields of the target type. 

(2) numFieldsMined = 0 

(3) for t = 1 TO N f 

(4) Create and instantiate the set of field theories, T Fi . 

(5) t beati = null 

(6) for j = 1 TO N r 

(7) lastOffsetj = 0 

(8) for t = 1 TO N/ 

— Try to find field i in every record — 

(9) for j = 1 TO N r 

(10) for each unit theory t in T Fi 

(11) start = findFieldStart(t, r^lastOffsetj) 

(12) if (start ^ -1) then 

(13) - end = findFieldEnd(t, restart) 

(14) if (end ^ -1) 

( 1 5) t. predicted = ^predicted U { < rj , start, end > } 

(26) t.falsePos+ = findFalsePositives({< r^start.end >},Tj. fields) 

(17) if (start = — 1 V end = -1) then 

(Ig) t.falseNeg + = findFalseNegatives(i,rj. fields) 

Find the first theory that perfectly predicted field i using the theories — 

(19) for each theory t in I>< 

(20) if (t.falsePos == 0 A t.falseNeg == 0) then 

(21) *6e#t, = t 

(22) numFieldsMined ++ 

(23) for i = 1 TO 

(24) newOffsetj = findFieldEnd(t t rj, findFieldStart(t t r jy lastOffsetj)) 

(25) if (newOffsetj j& -1) then 

(26) lastOffsetj = newOffsetj 

(27) break 

(28) if (numFieldsMined = N;) then 

(29) for t = l TO TV/ 

(30) For each element < r, start, end > in t^au predicted 

(31) if (< $tart, end > £ r.elements) 

(32) r.e/emen2s = r. elements U {< start, end > } 

■ < 

Figure 12: The simplified algorithm for mining and parsing record types. 
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1. Not all fields are present in each record instance and no single instance is guaranteed to have 
every field in it. Thus we may have to look at more than one record instance to determine 
the field order and may not be able to determine a unique ordering even if we look at every 
record instance. 

2. An important ordering may only exist between subsets of the fields in the record and the 
other fields may exhibit an inconsistent ordering. 

The miner uses a simple algorithm that computes for each field, the set of all fields that have 
preceded it in at least one record and the set of all fields that have followed it in at least one 
record. The two sets can be used to find a totally consistent ordering if one exists although it is 
not guaranteed to handle the second complication from above. So far, this has not been a problem 
in the documents we have mined. 

The psuedo-code for the record mining algorithm is shown in Figure 12. Lines 1 through 7 
do theory creation and general initialization. Theory variables are instantiated by comparing the 
corresponding fields in all of the records, looking for a common pre-gap marker, start marker, end 
marker, or post-gap marker as was done with list elements. 

The for W starting at line 8 perforins the meat of the algorithm: All of the record instances 

of the target type are parsed in parallel, one field at a time. Each theory for the current field 
number is tried to see if it can identify the field in each of the record instances (lines 11-15) If 
so, the algorithm calls findFalsePositive to see if the field should really have been found. Using 
assumption 4, a predicted field is a false positive if it has not already been defined and the user 
has defined at least one field in the record instance. If no field was found for a record instance, the 
algorithm checks that none was defined by the user (lines 17-18). 

After all of the theories have been tried on all of the record instances, the first consistent theory 
is chosen (lines 19-26). In addition, the current onsets into all of the records are updated to account 
for the newly parsed fields. The main loop then repeats, starting from the new oflsets and looking 
for the next field. 

After all of the fields have been parsed, the algorithm checks that it has found a consistent 
theory for every field. If it has, all of the fields predicted by the consistent theories are added to 
their records. 

4.2 HTML Parser 

The HTML parser available as part of NoDoSE parses documents or subdocuments based com- 
pletely on structural information using recursive descent. Unlike the plain text miner, the HTML 
parser does not store any internal information about a type since it parses based on the static 
grammar rules of HTML. The tags that are understood by the parser are listed in Table 2 with a 
brief description (due to space constraints) of how they are represented in the document tree. Other 
tags are just considered part of the text of the document. The following discussion is necessarily 
brief due to space constraints. 

In practice, the HTML parser generates more structure than the user is interested in (i e 
information from meta tags). After parsing, he should delete the nodes that are not of interest 
and rename types and labels to be semantically meaningful. The changes are not recorded by the 
HTML parser, though, so the next instance of the same page type will require all the changes to 
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Tags 


Representation m 


head,title,meta 


Head is represented as a Record with two fields: title, which is a String extracted 
from the title tag, and meta, which is a List of strings extracted from the meta tags. 


body 


Body is represented as a SemiList. Typically its children will be first level headings. 


hl,h2 s h3, 
h4,h5,h6 


Each heading is represented as a SemiList. Typically its children are paragraphs, 
represented as Strings, and sub-headings. 


ul,ol,dir 
menu,li 


All of the list formats are represented by a SemiList of list items. Lists can be 
arbitrarily nested. 


table.tr ,td 
anything else 


A table is represented as a SemiList of rows. Each row is a SemiList of the data 

in the cells. — 

Represented in the tree as a string. 



Table 2: Translation of HTML tags into document structure. 



be made again. To avoid this problem, the plain text miner can be run on the parsed document 
produced by the HTML miner. It will try to infer the format of the file just as if the structure had 
been entered using the GUI. If it is successful, future HTML files of the same type can be parsed 
automatically. 

The current version of the plain text parser only works with structured types, however, so 
additional modification of the document tree produced by the HTML parser is necessary. Thus the 
typical use of the HTML parser follows these steps: 

1. Use the GUI and plain text miner to decompose the document until the HTML portion is 
reached. If the entire document is HTML, this step can be skipped. 

2. Run the HTML parser on the HTML portion. This creates a sub-tree below the current node 
that represents the HTML portion of the document parsed. 

3 Use the editing capabilities of the GUI to convert the sub-tree to a form suitable for mining. 
For example, the user must delete the nodes that are not of interest and rename types and 
labels to be semantically meaningful. Also, the nodes created by the HTML parser will all 
be semistructured (except for the atomic types). The types of these nodes must be changed 
to a structured type before the miner described in the previous section can work on them. 

4. If there are other documents of the same type, load them and use the plain text miner to 
mine them. 

5. Perform any needed edit steps and repeat step 4 until the plain text miner get the correct 
results for all of the files. 

The approach of running the miner on the tree produced by the parser is an attractive one since 
it lets the text miner benefit from the knowledge of the document syntax without requiring that 
the parser and miner communicate directly. The same approach can be used on other files types 
with known syntax such as latex files or mail files. For example, consider a mail file consisting of 
responses to a survey. Each message will have all of the normal header information plus a highly 
formatted message body with the survey answers. By parsing the file using the mail syntax and 
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then mining the message bodies to structure the survey responses, a user can quickly export the 
survey data along with header data (send, time it was sent) to a DBMS. 

5 Implementation 

A prototype that implements a subset of the components described in Section 3 has been im- 
plemented in approximately 5000 lines of Java code. Two of the components, the Transaction 
Manager and the Log Manager, have not been implemented at all although the interfaces have 
been designed; thus NoDoSE version 1.0 does not support undo. Also, the current version only 
contains one Reporting component that writes the extracted data in a generic format similar to 
OEM. It also outputs an ODL schema for the data if it is not semistructured. 

We have run NoDoSE on many different files, including simulator output, mail files, c source 
code, OCRed documents, and many web pages. The results are difficult to quantify, although 
overall we have been pleased at the wide range of documents NoDoSE can extract data from. Part 
of this success, however, is the result of learning how to work around the quirks of the system. For 
example, the miner is currently very sensitive to where the user chooses the boundaries between 
list elements and record fields (i.e. whether to include the final carriage return in the selected text). 

Also, the dependence of many of the theories on constant string markers causes problems. For 
instance, the records of type OneParam from the simulation output example (Figure 2) look like " 
5 misDL - (avg) 9.295315E-01 - (std) 2.559869E-01 - (num) 2455" where the first value, 5 in this 
case, is the simulator node number that the measured variable is in. This value is redundant and is 
thus not part of the record type we defined for OneParam. The consistent pre-gap marker for the 
variable name (misDL in this case) is two spaces since OneParam records from other nodes start 
with their own node number, i.e., 44 4 w . Unfortunately, identifying the beginning of the variable 
name by two spaces will fail since the node number, which precedes the variable name in the text, is 
also preceded by two spaces, and will thus be falsely identified as the variable name. This problem 
can be avoided by including the node number in the record even though it is redundant. Doing so 
"eats" the node number so that two spaces serves as an adequate pre-gap marker for the variable 
name field. To obviate the need for such workarounds in the future we're more developing more 
flexible markers based on regular expressions. 

The performance of the system is fine for small files — the mining wait is never more than a 
second or two. We have not been able to deal with large files, however, due to the state of Java. 
One problem is with the TextArea component which is used to display portions of the file on the 
right side of the program window. The component in the toolkit uses the windows peer which does 
not support files over 32kb and we have not been able to find a pure Java component with adequate 
performance for files over lOOkb. Luckily, many interesting document types, especially web pages, 
are well within this limit so this restriction has not significantly hampered our research, although 
it has made measuring scalability impossible. Given the commercial push for Java, it's reasonable 
to believe that such problems will be corrected in the near future. 
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6 Related Work 



NoDoSE makes two major contributions to the data extraction problem: its open architecture for 
structural mining and the plain text mining component that has been implemented in version 1.0 
of the system. To our knowledge, the former has not been proposed by any other researchers and 
hence we do not discuss it further in this section. Instead, we concentrate on the approaches others 
have taken to the latter problem: mining structure and extracting data from documents. 

The three efforts that are most closely related to our own are [AK97a,AK97b), [HGMC+97], and 
[KWD97]. The system built by Ashish and Knoblock [AK97a] is closest to NoDoSE in its approach: 
to infer the structure of a document by combining automatic analysis with user input. Their system 
is designed for web pages only: it uses font size information, HTML tags, and indentation to guess 
a page's structure. A user can then correct the guesses by instructing the system to ignore certain 
keywords and by identifying new keywords that the system missed. The advantage of this system 
is that certain types of pages can be parsed with very little user input since the system leverages 
its knowledge about HTML syntax and about how characteristics like font size are used to indicate 
nesting. The major disadvantages of the system is that because it depends on HTML tags, it is 
not useful for any other type of document. Also, it deals with only single instances of documents 
so it is unclear that it can be used in cases where no single instance of a document type has all of 
the features of the type. 

Kushmerick, Weld, and Doorenbos describe a system [KWD97] that automatically extracts 
data from web pages although it will also work with plain text files. The extracted data must be 
representable as a set of tuples; no deep structure can be inferred. The advantage of the system 
is that no user interaction is required — the system infers the grammar of a document through a 
machine learning algorithm applied to many instances of the document type. The algorithm must 
be provided with domain knowledge, however, in the form of oracles that can identify interesting 
types of fields within a document. Further, if the algorithm fails, there is no information the user can 
provide to help it. The authors report a success rate of 48% on Internet information resources which 
is impressive for a fully automatic algorithm but not adequate for most applications. Still, their 
work contains interesting ideas for automatic parsing and their notion of corroborating recognizers 
is similar to way we evaluate theories against user input. 

The final system we discuss, that of Hammer et. aL [HGMC+97], is unlike the others in that 
it is fully manual — the user must code a wrapper for their document type using a toolkit. The 
toolkit provides many constructs, especially for HTML processing, that make it easier than writing 
a parser directly in Lex and Yacc [Joh75]. It also provides the most control over the output format 
of the extracted data as well as the best support for semi-structured data. The obvious disadvantage 
is that the user must be able to analyze their documents and then code their wrapper which limits 
the usefulness of this approach as a rapid data integration tool. We are considering it, however, as 
one of the output formats of NoDoSE to give users more control over the output format of their 
data. 

For comparison, the salient features of the three related projects described above, as well of 
those of NoDoSE, are summarized in Table 3. In comparison to the other systems, NoDoSE has 
two primary advantages: 

1. It is the only system that can infer the structure of text files and has support for HTML 
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System 


Grammar 
Generation 


Nested 
structure 


Semi-structured 
data 


Text 
documents 


Support 
for HTML 


Open 
Architecture 


Ashish & 
Knoblock 


Semi-automatic 


Yes 


Somewhat 


No 


Yes 


No 


Hammer et. al. 


Manual 


Yes 


Yes 


Yes 


Yes 


N/A 


Kushmerick & 
Weld & 
Doorenbos 


Automatic 


No 


No • 


Yes 


No 


No 


NoDoSE | Semi-automatic 


Yes 


Somewhat Yes 


Yes 


Yes | 



Table 3: Comparison of different data extraction tools. 



documents. 

2. It is the only system that can serve as a test bed for structure extraction experiments since 
the mining components are well separated from the rest of the system. 

We note that except for the system of Hammer et. al. (which does not have a mining component), 
there is no reason that the other mining algorithms could not be integrated as mining components in 
NoDoSE. This would yield improved handling of HTML documents or with portions of documents 
that contain HTML while retaining the plain text capabilities. 

Finally, we mention two additional studies that are related to this work. First, in [DEW97] the 
authors built a system, ShopBot, for the automatic extraction of product and pricing information 
from on-line shopping web sites. The system performs reasonably well since it is able to leverage its 
domain knowledge about shopping but is not applicable to other domains and it was not considered 
in the above comparison. Second, in [AJ97] the authors describe a fully manual GUI-based tool 
for converting structured files from one format to another. Unfortunately, not enough information 
is provided to compare it to the studies described above. 

7 Conclusions 

Given the amount of interesting data that is in HTML pages or text files rather than in database 
systems, users have a strong need for a tool to extract data from such sources. This paper described 
a tool, NoDoSE, designed explicitly for these needs. NoDoSE serves two purposes. First, it provides 
a general architecture for the exploration of the data extraction problem, allowing other researchers 
to plug in their own mining algorithms, user interfaces, or report generators, without having to 
build the entire framework themselves. Second, it contains a component that is capable of inferring 
the structure of a useful class of text files, allowing data to be quickly extracted without coding. 

The results of using NoDoSE on the type of documents we were originally targeting, simulation 
output and web pages, are promising. We have also noticed that many documents with complex 
parsing rules, such as files of c code, that should be beyond NoDoSE's reach are not due to stylistic 
conventions like indentation and standard comment blocks. On the other hand, we have occasionally 
been surprised to find very simple and very regular looking documents that NoDoSE cannot handle. 
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j .f^ a n V nfthP narsine theories on constant markers 
« the failure is due to the fig^^JZgZla** an alternate approach 
that deiimit fet dements or re^rdfie^ are curre ^ ^ ^ 

based on regular expressions. We are also nnisinu& 
and wi n release NoDoSE over the web soon thereafter. 
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Abstract 

To simplify the task of obtaining information from the 
vast number of information sources that are available on the 
World Wide Web (WWW), we are building information medi- 
ators for extracting and integrating data from multiple Web 
sources. In a mediator based approach, wrappers are built 
around individual information sources to translate between 
the mediator query language and the individual sources. 
We present an approach for semi-automatically generating 
wrappers for structured internet sources. The key idea is 
to exploit formatting information in Web pages to hypothe- 
size the underlying structure of a page. From this structure 
the system generates a wrapper that facilitates querying of 
a source and possibly integrating it with other sources. We 
demonstrate the ease with which wc arc able to build wrap- 
pers for a number of Web sources using our implemented 
wrapper generation toolkit. 



1. Introduction 

We are building information agents or mediators to 
gather and integrate information from multiple World Wide 
Web sources- Hie mediator [3, 18] approach has been 
used to integrate information from distributed hcteroge- 

*This work is supported in part by ihc University of Southern Califor- 
nia Integrated Media Systems Center (IMSC) - a National Science Founda- 
tion Engineering Research Center, by the Rome Laboratory of the Air Forco. 
Systems Command and the Defense Advanced Research Projects Agency 
(DARPA) under contract number FJ0602-94-C-02 10, by the National Sci- 
ence Inundation under grant number IR1-93 1 3993. and by the DARPA Fort 
Huachuca Contract DABT63-96-C-0066. The views and conclusions con- 
tained in this paper are the authors' and should nni he interpreted as repre- 
senting the official opinion or policy of DARPA, RL, NSF oi uny person oi 
agency connected with them. 



neous database systems, where a mediator insulates the user 
from problems caused by different locations, query lan- 
guages, and protocols of the different sources. We are ex- 
tending the mediator approach to integrate information from 
multiple Web sources. Our approach is to take several re- 
lated Web sources in a particular domain of interest (e.g., 
finance, government, or real-estate) and provide integrated 
access to multiple Web sources through a mediator. 

For example, we can use a mediator to provide integrated 
access to multiple Web sources that provide information 
on countries in the world. An excellent Web source is the 
CIA World Fact Book, 1 which provides information on the 
geography, economy, government, etc., of every country. 
Other interesting sources include the Yahoo listing of 
countries by region from where we can obtain information 
such as what countries are in Europe, the Pacific Rim, etc. 
Another interesting source is the on-line listing of country 
corruption rankings. A user could query a mediator that pro- 
vides access to the above sources to answer queries such as 
* » Find the Economic Overview, Telephone 
System and Corruption Rankings of all 
countries in the Pacific Rim.'' The media- 
tor would determine what sources can be used to answer the 
query, retrieve in formal ion from these sources, and present 
the integrated result to the user. There are several other 
research projects that are working on integrating Web- 
based sources. 'l*hese projects include InfoSleuth |4], the 
OBSERVER project [15], the Information Manifold [11], 
and the Internet Softbot [7]. 

An essential component in a mediator architecture is a 
wrapper around each individual data source (see Figure 1), 
which accepts queries from the mediator, translates the 
query into the appropriate query for the individual source, 
performs any additional processing if necessary, and returns 

1 hllp;//www.odci.gov/cia/publicaiion5^nsolcVwfb-aH.hfm 
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the results to the mediator. To make the integration of Web 
sources using the mediator approach feasible, wrappers are 
needed for all of the Web sources to be accessed. Wrap- 
pers for Web sources would accept a query from the media- 
tor, fetch the relevant pages from thai source, extract the re- 
quested information from the retrieved pages and return the 
results to the mediator. Essentially the wrappers make the 
Wet) sources look like databases that can be queried through 
the mediator's query language. The basic techniques ap- 
plied in database integration using mediators can then be ap- 
plied to Web sources integration. It is however impractical 
too >nsiruct wrappers for Web sources by hand for a number 
of reasons: 

The number of information sources of interest is very 
large, even within a particular domain. 

Newer sources of interest are added quite frequently on 
the Web. 

The format of existing sources often changes. 

We report on the development of an implemented wrap- 
per generation toolkit that provides a semi-automatic, inter- 
active wrapper generation facility for Web sources. It should 
be noted that building wrappers is just one of the challenges 
in building the kinds of information mediators for the Web 
that we envision. Problems lie in several other areas such 
as modeling the information sources, resolving semantic 
heterogeneity amongst different sources, query planning to 
gather the requested information from different sites, and in- 
tell gently caching retrieved data, to name a few. The focus 
of t lis paper is solely on wrapper generation. 

'lie rest of this paper is organized as follows. Section 2 
provides an overview of the different kinds of information 
sources on the Web. Section 3 describes how we semi- 
automatically generate wrappers. Section 4 presents exper- 
imental results to demonstrate the effectiveness of our lech- 
niq Lies for wrapper generation. Section 5 describes related 
work. Section 6 presents future directions and conclusions. 

2. 'types of Web Information Sources 

We categorize the types of pages from Web sources into 
three classes: multiple-instance sources, single-instance 
sources, and loosely -structured sources. Certain sources 
provide information in multiple pages, all conforming to 
the same format. We call such sources multiple- instance 
sources. Consider a source such as the CIA World Fact 
Book. This source provides information on each of the 
267 countries in the world, with information for each 
. niry presented on a separate page for that country. The 
information on each page is presented in a semi -structured 
manner since each page can be clearly sub-divided inio 



distinct sections with headings labeling the beginning 
of each section. Also, the information on all pages is 
presented in exactly the same format. A page for one 
country is shown in Figure 2. There are clearly identifiable 
sections such as Geography, Area, Land boundaries, etc., 
on each page. For each individual page we would like the 
wrapper to handle queries about one or more sections in the 
page. For example, * * Find the Land boundaries 
and Area of France. ' ' This wrapper will in turn 
allow a mediator to handle aggregate queries (spanning 
multiple countries) such as x *Find the National 
Product, and Defense Expenditures of 
all countries in Europe.'' 

There are a number of sources on the Web that fall in 
the multiple instance category, such as the National Science 
Foundation (NSF) Grants database, 2 the General Services 
Administration (GSA) On-line Shopping database, 3 the 
NSF Funding Opportunities database, Genetics databases 
such as OMIM, 4 or the Air Force Fact Sheets, 5 to name a 
few. It might be argued that for this category of sources, the 
information that is put on-line often comes from a database 
itself. Thus we should query the databases directly. Unfor- 
tunately for most of these sources, access to the underlying 
databases is simply not permitted or might be allowed only 
with a license fee to query the database. However the in- 
formation put on-line is readily and freely accessible, which 
makes a case for building wrappers in order to query these 
sources. 

Another category is that of semi- structured single in- 
stance pages. There arc numerous sources on the Web that 
contain useful information in a semi-structured form, but 
on a single page. To name a few, consider the CoopIS 
96 proceedings page, list of AAAI Fellows or the Ya- 
hoo list of countries by region. The CoopIS *96 pro- 
ceedings page 6 is organized into clearly identifiable sec- 
tions, with a heading for each section (such as Classifi- 
cation and Ontologies, or Data Integration etc.). Each 
section starts with the Chair of that section followed by 
papers presented in that section. From such a page 
we would like a wrapper to be able to answer queries 
such as * x Find the names of all people who 
chaired a session in CoopIS 9 6 and expect 
the wrapper to extract and return the list of chairs i.e., 
"Witold Litwin, James Geller, Klemens Bohm 

Finally there are pages that are more loosely structured, 
such as a personal homepage. For such cases, i.e., in the ab- 
sence of clearly identifiable sections with headings, the ex- 
traction task becomes much harder. Also the use of fancy 



2 hnp://cos.gdb.org/bcst/fcd fundMsf-intro.htm! 

3 http://www.fss.gsa.gov/ 

4 ht!p://w w w3 .ncbi .n Im.n ih .go v/Omi m 

s hup ://w w w.af. mil/p a/index pages/ fs J ndex. hi nil 
c hnp://sunsi ic.infonnauk.rwih-aachen.de/dbIp/db/ 
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Figure 1. Role of wrappers in providing integrated access to multiple information sources 



France 



Geography 

Location: Western Europe, bordering the Bay of Biscay and English Channel, between Belgium and Sp ain 
southeast of the U K; bordering the Mediterranean Sea, between Italy and Spain ■ 

Map references : Europe 



Area: 

total area: 547,030 sq km 
land area: 545,630 sqkm 

comparative area: slightly more than twice the size of Colorado .... 
note: includes Corsica and the rest of metropolitan France, but excludes the overseas administrative 
divisions 

Land boundaries : total 2,892.4 km, Andonra 60 km, Belgium 620 km, Germany 451 km, Italy 488 km, 
Luxembourg 73 km, Monaco 4.4 km, Spain 623 km, Switzerland 573 km 

Coastline : 3,427 km (mainland 2783 km, Corsica 644 km) 

Figure 2. Snapshot of a page from the CIA World Fact Hook 
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graphics or images for information presentation makes the 
task of building a wrapper for that source more difficult. 

In this paper we focus on semi -automatically building 
wrappers for semi-structured sources, in both the multiple- 
instance and single-instance categories. For loosely struc- 
tured sources or sources with complicated graphics we have 
to build wrappers manually. However it is the large number 
of sources in the semi -structured category, and the wealth of 
information that can be obtained from them that has moti- 
vated us to automate the task of wrapper generation for such 
sources. 

3. Approach to Automated Wrapper Genera- 
tion 

This section describes our approach to generating wrap- 
pers for Web sources. We have attempted to automate the 
process of building wrappers as much as possible. The fol- 
lowing steps arc involved in generating a wrapper for a new 
Web source: 

• Structuring the source: This involves identifying sec- 
tions and sub-sections of interest on a page. 

• Building a parser for the source pages: After struc- 
turing the source we build a parser that can extract se- 
lected sections from a page from the source. 

• Adding communication capabilities between the web 
sources, wrappers, and mediators: The mediator that 
integrates several sources must be able to communicate 
with the wrappers for these sources. Also, the wrappers 
must communicate with Web sources to retrieve data in 
order to answer queries. 

We describe these steps in detail below. 

3.1. Structuring the Source 

In specifying the structure of a page on the Web, two 
things need to be clearly identified: 

1 . Tokens of interest on a page. By tokens we mean words 
or phrases that indicate the heading of a section, such 
as Geography, Economy, or Total Area on the 
CIA World Fact Book page. A heading indicates the 
beginning of a new section; thus identifying headings 
identifies the sections on a page. 

2. The nesting hierarchy within sections. Once a page 
has been decomposed into various sections, we 
have to identify the nesting structure of the sec- 
tions. For instance a CIA World Fact Book page is 
comprised of the sections Geography, People, 
Economy, Government and Transportation. 



The Geography section in turn is broken down into 
the sections Area, Land boundaries, etc., while 
Area contains land area, total area, etc. 

The structuring task can be done automatically or with 
minimal user interaction. The key idea here is that a pro- 
gram analyses the HTML and other formatting information 
in a sample page from the source and guesses the interest- 
ing tokens on that page. The system also uses the formatting 
information to guess the nesting structure of the page. The 
heuristics used for identifying important tokens on a page 
and the algorithm used to organize sections into a nested hi- 
erarchy arc an important contribution of this work. We de- 
scribe them in more detail below. 

3.1.1 Identifying Tokens 

Tokens identifying the beginning of a section are often pre- 
sented in bold font in HTML. They may also be written en- 
tirely in upper case words, or may end with a colon. We can 
generate a lexical analyzer that searches a page for such to- 
kens using LEX f 14], a lexical analyzer generator. In Ta- 
ble 1 wc list the regular expressions given as specifications 
to LEX to identify tokens indicating headings on a page. 
From these specifications we generate a lexical analyzer that 
identifies words or phrases conforming to the regular ex- 
pressions. When structuring any page, the system is able to 
identify headings that are formatted in any of the ways listed 
in Table 1. For instance, given a page from the CM World 
Fact Book the system is able to identify the tokens of inter- 
est such as Geography, Land boundaries, Area etc. 
Since each token marks the beginning of a section on a page, 
at the end of the above token izing step all the different sec- 
tions on a page have been identified. 

3.1.2 Determining the Hierarchical Structure 

The next step is to obtain the nesting hierarchy of sections 
on the page, i.e., what sections comprise the page at the top 
level, what sub-sections comprise other sections in the page, 
etc. As with the tokenizingstep, the nesting hierarchy can be 
obtained in a semi-automatic fashion for a large number of 
pages. We have developed an algorithm that, given a page 
with all sections and headings correctly identified, outputs a 
hierarchy of sections. The following two simple heuristics 
are used: 

1 . Font Size -The font of the heading of a sub-section is 
generally smaller than that of its parent section. 

2. Indentation - Indentation spaces (which can be detected 
from raw text or HTML tags) arc often used to indicate 
that one section is a subsection of another. 
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Description 


Example Heading 




Headings in 
bold tags 


<fc»<a href = * % ' ' >Chair</b> 


1 *<' ' [hH] [0-6] [ >' ' r\n] + * '<* '/[hH] [0-6] ' *>' 


' Headings with 
font size 


<h3 >Geography< / h3 > 


* * <STRONG> ' ' [ ~ \n] + ' *</ STRONG > ' ' 


Headings in 
Strong Tags 


< S TRONG > Are a < / STRONG > 


% *<Strong> ' ' [~ \n] + 1 * </Strong> ' * 


Strong lags in 
different case 


<Strong>Population</ Strong* 


1 % <strong> ' ' [~\n] + 1 '</strong>» ' 


Strong tags in 
lower case 


<strong>Deadlines< /s trong> 


£A-2a-zO-9\-_ 3 + [:J 


Words ending 
in colon 


IRS NUMBER: 


* ' <* ' [il] [~<>3** *>''[" \nJ + * , </''[iI] , *>" 


Italicized 
words 


<i>total area:</i> 



Table 1. Heuristics for identifying tokens when structuring a page 



current_node= make_new_tree ( ) ; /* returns a node that is the root of a new tree */ ■ 
while (more_headings) { 

new_node=construct_node (heading ) ; / w makes a new node for the new section */ 
while ( (size„of (current_node) <= size_of (new_node) ) or 

(indentation_of (current_node) >= indentations f (new.node) ) ) { 

/* search for the immediate parent section of the new section */ 
current_node=parent_of (current_node) ; 

} 

make_rightmos t_chi Id ( current_node , new_node ) ; 

/* make the new section the rightmost child of its immediate parent */ 
current_node=new_node ? 

) 

generate_grammar ( ) ; 

/ * a procedure that from the tree constructed above, for each node N with ordered children 
CI, C2 ... , Cm outputs a grammar rule of the form N — > Cl C2 Cm */ 



Figure 3. Algorithm to obtain nesting hierarchy 



CI Apage -> Geography People Government Economy Transportation 
Geography -> Location Map-references Area Land-boundaries Coastline .. 
Area -> total.area land_area comparative-area 



Figure 4. Nesting Hierarchy for CIA Page 



164 



Using the procedure shown in Figure 3, the system out- 
puts a grammar describing the nesting hierarchy of sections 
in a page. This procedure first builds a tree that reflects the 
nesting hierarchy of sections. We construct a node for each 
heading that identifies a new section, and make this node a 
child of the section that should be its immediate parent based 
on the font size and indentation of the section headings. The 
children of each node are ordered, i.e., they appear in the 
same order in which the corresponding sections appear on 
the page. When all nodes for all sections have been placed in 
the tree, the procedure outputs grammar rules for each node 
in the tree, essentially stating that the section at each node 
has as sub-sections all its immediate children in the tree (and 
in the order in which they appear in the tree). For instance, 
for pages from the CIA World Fact Book the grammar output 
is shown in Figure 4. 

It is possible for the system to make mistakes when try- 
ing to identify the structure of a new page. Based on the 
heuristics listed in Table 1, the system can identify head- 
ings erroneously (that is identify some words or phrases as 
headings when they are not, or fail to identify phrases that 
are headings, but do not conform to any of the regular ex- 
pressions in Table 1). We have provided a facility for the 
user to interactively correct the system's guesses. Through 
a graphical interface the user can highlight tokens that the 
system misses, or delete tokens that the system erroneously 
chooses. The user can similarly correct errors in the system- 
generated grammar that describes the structure of the page. 

3.2. Building a Parser for the Source Pages 

The next step is to generate a parser for pages from the 
source. Given a page from the source, such a parser can 
extract any selected section(s) from the page. For instance 
a parser for pages from the CIA World Fact Book can ex- 
tract sections such as Geography. Area (the *\" indi- 
cates that Area is a subsection of Geography in the spirit 
of complex objects) i.e., the Area sub-section within the 
Geography section from the page for any country. Such a 
parser can be automatically generated, since all of the gram- 
matical and lexical information needed to parse the page 
is obtained at the structuring slep. The compiler genera- 
tor YACC (10] and the tool LEX are used for this purpose. 
The tokens identified in the structuring step are directly in- 
put as specifications to LEX to generate a lexical analyzer 
for a page from the source. For instance the tokens iden- 
tified in the CIA World Fact Book page arc Geography, 
Location. Map references. Area, total area, 
eic, and the specifications given to LEX to generate a lex- 
ical analyser for a page from the CIA World Fact Book arc 
shown in Figure 5. 



<h3>Geography<h3> {return <GEO_HEAD) ; ) 
<b>Location: </b> (return (LOC_HEAD) ; ) 
<t»Map references : </b> {return (MAP_HEAD) ; } 
<b>Area:</b>{return<AREA_KEAD) ; } 
<i> total area :</i>{ return (TOT_HEAD) ; ) 

. {return (TEXT) ; } 
\n {return {TEXT) ; } 



Figure 5. LEX Specifications for CIA Page 

The tool YACC can generate a parser for a language 
given grammar rules that specify valid sentences in the lan- 
guage. We directly translate the grammar rules describ- 
ing the overall structure of the page, obtained at the end of 
the structuring step, into a YACC specification. The parser 
generated can parse valid "sentences" i.e., pages from the 
source. Figure 6 shows what the rules specified to YACC 
to parse pages from the CIA World Fact Book look like. For 
instance, the first part of the first rule stales that a single page 
is comprised of the Geography section, People section, etc. 
The second part of the rule shows YACC code for storing 
and manipulating parsed data. With these specifications we 
use LEX and YACC to generate a parser for pages from the 
source. 

3.3. Adding Communication Capabilities between 
the Wrapper, Mediator and Web Sources 

Given a query, a wrapper for a Web source should be 
able to fetch the pages containing the requested informa- 
tion from the Web source. Also some mechanism is needed 
for communication between the mediator and the wrapper 
as they are separate processes, possibly running at different 
locations. The following communication functionality thus 
needs to he added to the wrapper. 

1 . Identifying network locations of page(s) needed to an- 
swer a query. For sources with just a single page this is 
st raigh I forward i.e., the URL for that page is known to 
the wrapper. For sources with multiple pages, a map- 
ping between a query and the URL of the relevant page 
might be required. For instance for the CIA World Fact 
Book there is a one to one mapping between the coun- 
try name and the URL of the page for that country. This 
mapping can be obtained from the index page for the 
CIA source. For the GS A database the port number ap- 
pears at the end of the URL for that source to point lo 
the page for that part. 

To provide the capability of determining the network 
location of the page relevant to a query, the user spec- 
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CIApage :Geographysection Pcoplesection Govemmentsection Economyseciion Transportsection 

{strcpy(S$,$l); strcat($3i,$2); strcat($$, $3); strcat($$,S4); } 

Geographysection : Locationscction Maprefsection Areasection Landboundariessection... 
{strcpy($$,$l);... } 

Arcaseclion :totalareasection landareasection compareasection 

{strcpy($$,$ 1 ); sircat($$.$2); strcat($$,$3); } 

Locationscction : Locationheading Text 

{strcpy($$,$l);strcat($$.$2); } 

Maprefsection : Maprefheading Text 

{strcpy($$, $1); strcat($$,$2); } 

Location heading : LOC.HEAD 

{ strcpy($$,yytext); } 

Maprefheading : MAP-HEAD 

{ strcpy($$,yytext); } 



Figure 6. YACC specifications for CIA page 



ifies a mapping function which takes necessary argu- 
ments from a query (eg. country name from a query on 
ihe CIA source) and constructs a URL pointing to the 
page to be fetched. 

2. Capability to retrieve data over the. network. Currently 
we are using PERL scripts for the purpose of making 
HTTP connections to the Web information sources and 
retrieving data from them. 

3. Communication between the mediator and wrapper. 
Wc are using the agent communication language 
KQML [8] for the purpose of providing interprocess 
communication between the mediator and a wrapper. 

Adding the above functionality is the final step in gener- 
ating a wrapper for a new source. The parser for pages from 
a Web source plus the above communication functionality 
results in a complete wrapper for that Web source. 

4. Results 

Wc have applied the wrapper generator to the task of 
generating wrappers for a variety of internet sources. We 
present experimental results to provide an idea of the effort 
required to generate a wrapper for a new source. The step 
that is most difficult to automate when generating a wrapper 
is the first slcp where we obtain the structure of a page or 
sample pages from the source. Generating the parser is then 
done automatically and defining a mapping function from 



queries to URLs of relevant for sources with multiple pages 
requires comparatively little effort on part of the user. It is 
thus the structuring step that dominates the lime and effort 
needed to build a wrapper for a new source. 

We used the wrapper generator to build wrappers for sev- 
eral internet sources and to evaluate the effectiveness of the 
heuristics we use for structuring a new page automatically. 
To provide a quantitative measure of the effectiveness of the 
heuristics, we define what we call correction steps. During 
structuring a page, each time the user has to manually cor- 
rect a token (i.e., add or delete a token) or correct a rule in 
the grammar describing the nesting hierarchy of sections, it 
is counted as one correction step. The total number of cor- 
rection steps made before the page is completely structured 
provides an estimate of how hard it is to automatically struc- 
ture that page. We also provide the time taken to generate the 
wrapper for each source. This would of course vary from 
user to user. Nevertheless, the results give a sense for ap- 
proximately how long it might take to generate wrappers us- 
ing this toolkit. 

Tabic 2 demonstrates the ease with which we built 
sources for a dozen internet sites, from both the multiple- 
instance and single-instance categories. We provide the 
number of correction steps to structure a sample page from 
each source as well as the total time (in minutes) taken to 
build a wrapper for that source. The results are extremely 
encouraging. Several sources require almost no or very few 
correction steps to structure them, thus showing that the 
heuristics for structuring pages are quite successful. Also, 
it takes only a few minutes to generate a wrapper for most 
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Multiple-instance sources 



Correc- 
tion 
steps 



Time 
in 

min 



Single instance sources 



Correc- 
tion 
steps 



Time 
in 

min 



1 . The CIA World Fact Book. 

2. GSA On-line Shopping database. 

3. The NSF database. 

4. The OMIM Genetics database 

5. Hoover Company Profiles 

6. 'Tie Internet Movie Database 

7. 'Tie Air Force Library Fact Sheets 



1 . CoopIS 96 Proceedings page 

2. AAAI-97 conference homepage 

3. List of US Universities by state 

4. List of AAA! Fellows by year 

5. Computer Science Job Listings 

6. SIGMOD Record page 

7. US Air Force Organization page 



Table 2. Experimental Results showing the effort and time to build wrappers for different sources 
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sources. Such a toolkit is thus extremely useful as it 
des a convenient and quick way to generate wrap- 
for new sources of information on the Web and then 
ate them via a mediator. Wc successfully integrated 
sources in the countries information domain, such 
CIA World Fact Book, Yahoo listings of countries by 
etc. using the Ariadne system, which is a descendant 
of the SIMS 1 3] information mediator that addresses 
the problem of integrating Web sources. We were then 
able to pose queries to Ariadne such as * * Find the 
External debt and Defense expenditures 
of countries in the EEC.'' The answer 

given by the mediator is shown in Figure 7. 

5. Related work 



Generating wrappers for databases and Web sources, and 
ding database like querying for semi -structured data 
rtsearch areas that have received considerable attention 
. Hammer et al. [9] developed a template-based ap- 
to generating wrappers for Web sources and other 
of legacy systems. With their approach, the user 
des actions for the system to execute when a query 
matches a certain template or format. This approach pro- 
a way of rapidly constructing wrappers by example, 
could require a large number of examples to specify a 
c source. 

Duorenbus et ah [<3] developed an Internet comparison 
shopping agent that can automatically build wrappers for 
sites. Since they focus on pages that contain items for 
they make much stronger assumptions about the type 
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of information they are looking for and use that information 
to hypothesize the underlying structure. Their wrapper lan- 
guage is not very expressive and the system is quite limited 
in terms of the types of pages for which it can generate wrap- 
pers. 

Kushmerick et al. [13] also developed an approach to au- 
tomatically generating wrappers. The focus of their work is 
very similar to ours i.e., building wrappers for Web sources 
to be integrated by a software agent. However, they fol- 
low a very different approach that uses inductive learning 
techniques to build a program that extracts data from a Web 
page. They assume that they are given a set of recognizers 
that can then be used to generate examples for the learning 
system. The advantage of their approach is that the result- 
ing wrappers will be more robust to inconsistencies across 
multiple-instance pages. On the other hand, their approach 
could not be used to generate wrappers for more complex 
pages, such as the CIA World Fact Book, without first build- 
ing recognizers for each of the fields of those pages. 

There is also a variety of work that addresses issues in 
directly querying semi-structured data, particularly data ob- 
tained from Web sources, in a database- like fashion [ 1 , 5, 2, 
12, 16]. These efforts are concerned with issues such as the 
development of data models and query languages for semi- 
structured data, defining formal semantics for these query 
languages, and efficiently implementing these languages. 
The focus of our work is on the generation of wrappers that 
provide a uniform interface to a variety of semi -structured 
data, as opposed toefforts that support direct querying of the 
data. 
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Figure 7. Answer to query involving multiple Web sources 



6. Future work and conclusions 

We have presented the ideas and results of our approach 
for automatically generating wrappers for Web sources. We 
have clearly separated the tasks in building wrappers that are 
specific to a particular Web source such as structuring the 
source, and tasks which are repetitive for any source (and 
can thus be done by the system) such as generating a parser 
from the structure of a page and adding communication ca- 
pabilities. The main contribution of our work is automating 
the structuring step, through the use of heuristics for deter- 
mining the structure by exploiting formatting information in 
pages from the source. Our ideas appear to be effective for 
many types of semi -structured sources. However we need 
more advanced wrappers to be able to broaden the scope of 
sources we can generate wrappers for and also to be able to 
handle finer grained queries. Currently wc are working on 
enhancing the wrappers with the following capabilities: 

• Learning new tokens by examples: It is possible that 
while structuring a page, the system is unable to iden- 
tify tokens on the page if they do not conform to any of 
the regular expressions in Table 1 . We are working on 
adding capabilities to the system to quickly learn the 
structure of a new kind of heading from a few user ex- 
amples. We are applying techniques for inducing Hid- 
den Markov models (HMMs), describing the tokens, 
from corpora of positive examples. The basic idea, de- 
scribed in [ 17] is to start with an HMM accepting only 
the initial tokens marked by the user. Then, states in the 
HMMaic merged to yield a generalized model that can 
be used to identify the remaining tokens in the page. 
The system can then identify the remaining tokens in 
the page automatically. 

♦ Handling Tables: A challenging problem is to automat- 
ically build parsers for information in tables. The hard 
problem here is to determine exactly what is contained 
in the different rows and columns of the table and then 
build a parser to extract information from it. 



• Handling finer grained queries: Consider the Land 
boundaries section on a CIA World Fact Book page. 
Currently the wrapper cannot handle queries such 
as % *Find the names of all countries 
bordering France ' ' as the parser does not have 
enough knowledge of the structure within the Land 
boundaries field to decompose that field into pairs 
of countries and corresponding border lengths. We 
are currently investigating using machine learning 
techniques where the user gives a few examples 
highlighting items of interest within a field and the 
system is eventually able to learn the structure within 
that field. 

Wc are using the wrapper generator system to gener- 
ate wrappers for semi -structured sources and arc working 
on making the system more advanced and capable of han- 
dling more kinds of Web sources. Generation of wrappers is 
very useful in meeting our broader goal of integrating Web 
sources via a mediator, by which we hope to simplify the 
task of obtaining information from I he already numerous 
and ever growing information sources on the Web. 
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1 Introduction 

The popularity of the World-Wide Web (WWW) has made 
it a prime vehicle for disseminating information. The rel- 
evance of database concepts to the problems of managing 
and querying this information has led to a significant body 
of recent research addressing these problems. Even though 
the underlying challenge is the one that has been tradition- 
ally addressed by the database community - how to manage 
large volumes of data - the novel context of the WWW forces 
us to significantly extend previous techniques. The primary 
goal of this survey is to classify the different tasks to which 
database concepts have been applied, and to emphasize the 
technical innovations that were required to do so. 

We do not claim that database technology is the magic bullet 
that will solve all web information management problems; 
other technologies, such as Information Retrieval, Artificial 
Intelligence, and Hypertext/Hypermedia, are likely to be 
just as important. However, surveying all the work going 
on in these areas, and the interactions between them and 
database ideas, would be far beyond our scope. 

We focus oh three classes of tasks related to information 
management on the WWW. 

Modeling and querying the web: Suppose we view the 
web as a directed graph whose nodes are web pages and 
u-hose edges are the links between pages. A first task we 
consider is that of formulating queries for retrieving certain 
pages on the web. The queries can be based on the content 
of the desired pages and on the link structure connecting the 
pages. The simplest instance of this task, which is provided 
by search engines on the web is to locate pages based on the 
words they contain. A simple generalization of such a query 
is to apply more complex predicates on the contents of a 
page (e.g., find the pages that contain the word "Clinton" 
next to a link to an image). Finally, as an example of a query 
that involves the structure of the pages, consider the query 
asking for all images reachable from the root of the CNN web 
site within 5 links. The last type of queries are especially 
useful when detecting violations of integrity constraints on 
a web site or a collection of web sites. 
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Information extraction and integration: Certain web 
sites can be viewed at a finer granularity level than pages, 
as containers of structured data (e.g., sets of tuples, or 
sets of objects). For example, the Internet Movie Database 
(http://vvv.ijBdb.coB) can be viewed as a front end in- 
terface to a database about movies. Given the rise in the 
number of such sites, there are two tasks we consider. The 
first task is to actually extract a structured representation 
of the data (e.g., a set of tuples) from the HTML pages 
containing them. This task is performed by a set of wrap- 
per programs, whose creation and maintenance raises several 
challenges. Once we view these sites as autonomous hetero- 
geneous databases, we can address the second task of posing 
queries that require the integration of data. The second task 
is addressed by mediator [or data integration) systems. 

Web site construction and restructuring: A differ- 
ent aspect in which database concepts and technology can 
be applied is that of building, restructuring and managing 
web-sites. In contrast to the previous two classes which ap- 
ply on existing web sites, here we consider the process of 
creating sites. Web sites can be constructed either by start- 
ing with some raw data ( stored in databases or structured 
files) or by restructuring existing web sites. Peforming this 
task requires methods for modeling the structure of web site 
and languages for restructing data to conform to a desired 
structure. 

Before we begin, we note that there are several topics con- 
cerning the application of database concepts to the WWW 
which are not covered in this survey, such as caching and 
replication (see [WWW98. GRC97] for recent works), trans- 
action processing and security in web environments (see 
e.g. [Bil98]), performance, availability and scalability issues 
for web servers (e.g. [CS98]), or indexing techniques and 
crawler technology (e.g. [CGMP98]). ( Furthermore, this 
is not meant to be a survey on existing products even in 
the areas on which we do focus. Finally, there are several 
tangential areas whose results are applicable to the systems 
we discuss, but we do not cover them here.* Examples of 
such fields include systems for managing document collec- 
tions and ranking of documents (e.g., Harvest [BDH + 95], 
Gloss [GGMT99]) and flexible query answering systems [BT98]. 
Finally, the field of web/db is a very dynamic one; hence, 
there are undoubtedly some omissions in our coverage, for 
which we apologize in advance. 

The survey is organized as follows. We begin in Section 2 by 
discussing the main issues that anse in designing data mod- 
els for web/db applications. The following three sections 
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consider each of the aforementioned tasks. Section 6 con- 
cludes with perspectives and directions for future research. 

2 Data Representation for Web/DB Tasks 

Building systems for solving any of the previous tasks re- 
quires that we choose a method for modeling the underlying 
domain. In particular, in these tasks, we need to model the 
web itself, structure of web sites, internal structure of web 
pages, and finally, contents of web sites in finer granularities. 
In this section we discuss the main distinguishing factors of 
the data models used in web applications. 

Graph data models: As noted above, several of the ap- 
plications we discuss require to model the set of web pages 
and the links between them. These pages can either be on 
several sites or within a single site. Hence, a natural way 
to model this data is based on a labeled graph data model. 
Specifically, in this model, nodes represent web pages (or 
internal components of web pages), and arcs represent links 
between pages. The labels on the arcs can be viewed as at- 
tribute names. Along with the labeled graph model, several 
query languages have been developed. One central feature 
that is common to these query languages is the ability to 
formulate regular path expression queries over the graph. 
Regular path expressions enable posing navigational queries 
over the graph structure. 

Semis tructu red data models: The second aspect of mod- 
eling data for web applications is that in many cases the 
structure of the data is irregular. Specifically, when model- 
ing the structure of a web site, we don't have a fixed schema 
which is given in advance. When modeling data coming from 
multiple sources, the representation of some attributes (e.g., 
addresses) may differ from source to source. Hence, several 
projects have considered models of semistructured data. The 
initial motivation for this work was the existence and rela- 
tive success of permissive data models such as [TMD92] in 
the scientific community, the need for exchanging objects 
across heterogeneous sources [PGMNV95], and the task of 
managing document collections [MP 9 6], 

Broadly speaking, semistructured data refers to data with 
some of the following characteristics: 

• the schema is not given in advance and may be implicit 
in the data, 

• the schema is relatively large (w.r.t. the size of the 
data) and may be changing frequently, 

• the schema is descriptive rather than prescriptive, i.e.. 
it describes the current state of the data, but violations 
of the schema are still tolerated, 

• the data is not strongly typed, i.e., for different ob- 
jects, the values of the same attribute may be of dif- 
fering types. 

Models for semistructured data have been based on labeled 
directed graphs [Abi97. Bun97).* In a semistructured data 
model, there is no restriction on the set of arcs that emanate 
from a given node in a graph, or on the types of the values of 

1 It should be noted that there is no inherent difficulty in trans- 
lating these models into relational or object-oriented terms. In fact, 
the languages underlying Description Logics (e.g.. Classic (BBMR89]) 
and FLORID [HLLS97J have some of the features mentioned above, 
and are described tn non-graph models. 



attributes. Because of the characteristics of semistructured 
data mentioned above t an additional feature that becomes 
important in this context is the ability to query the schema 
(i.e., the labels on the arcs in the graph). This feature is 
supported in languages for querying semistructured data by 
arc variables which get bound to labels on arcs, rather than 
nodes in the graph. 

In addition to developing models and query languages for 
semistructured data, there has been considerable recent work 
on issues concerning the management of semistructured data, 
such as the extraction of structure from semistructured data 
[NAM98], view maintenance [2GM98, AMR + 98], summa- 
rization of semistructured data ([BDPS97, GW97]), and rea- 
soning about semistructured data [CGL98, FFLS98], Aside 
from the relevance of these works to the tasks mentioned in 
this survey, the systems based on these methods will be of 
special importance for the task of managing large volumes 
of XML data [XML98]. 

Other characteristics of web data models: Another 
distinguishing characteristic of models used in web/db ap- 
plications is the presence of web-specific constructs in the 
data representation. For example, some models distinguish 
a unary relation identifying pages and a binary relation for 
links between pages. Furthermore, we may distinguish be- 
tween links within a web site and external links. An im- 
portant reason to distinguish a link relation is that it can 
generally only be traversed in the forward direction. Addi- 
tional second order dimensions along which the data models 
we discuss differ are (1) the ability to model order among 
elements in the database, (2) modeling nested data struc- 
tures, and (3) support for collection types (sets, bags, ar- 
rays). An example of a data model that incorporates explicit 
web-specific constructs (pages and page schemes), nesting, 
and collection types is ADM, the data model of the Ara- 
NEUS project [AMM97b], We remark that all the models we 
mention in this paper represent only static structures. For 
example, the work on modeling the structure of web sites do 
not consider dynamic web pages created as a result of user 
inputs. 

An important aspect of languages for querying data in web 
applications is the need to create complex structures as a 
result of a query. For example, the result of a query in 
a web site management system is the graph modeling the 
web site. Hence, a fundamental characteristic of many of 
the languages we discuss in this paper is that their query 
expressions contain a structuring component in addition to 
the traditional data filtering component. 

Table 1 summarizes some of the web query systems cov- 
ered in this paper. A more detailed version of this table, 
ht tp : //www . cs . Washington . edu/homes/alon/ vebdb . html in- 
cludes URLs for the systems where available. In subsequent 
sections we will illustrate in detail languages for querying 
data represented in these models. 

3 Modeling and Querying the Web 

If the web is viewed as a large, graph-like database, it is 
natural to pose queries that go beyond the basic informa- 
tion retrieval paradigm supported by today's search engines 
and take structure into account; both the internal structure 
of web pages and the external structure of links that inter- 
connect them. In an often-cited paper on the limitations of 
hypertext systems, Halasz says: [Hal88] 
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System 


| Data Model 


| Language Style 


pPath Expressions 


| Graph Creation | 


WebSQL [MX1M97] 


relational 


SQL 


Yes 


No 


W3QS [kS95j ' 


labeled multigraphs 


SQL 


Yes 


No 


WebLog 1LSS96J 


relational 


Datalog 


No 


Ho ' ■ 


Lord [AQM+97] 


labeled graphs 


OQL 


Yes 


No 


WebOQL \AM98\ 


hypertrees 


OQL 


Yes 


"Yes 


UnQL [Bt>HS96j 


labeled graphs 


structural recursion 


Yes 


Ves 


Strudel [FFK + 98, FFLS97] 


labeled graphs 


Datalog 


Yes 


Yes 


Araneus (Ulixes) [AMM97bJ 


page schemes 


SQL 


Yes 


Yes 


Florid [HLLS97J 


F-logic 


Datalog 


Yes 


No 



Table 1: Comparison of query systems 



Content search ignores the structure of a hy- 
permedia network. In contrast, structure search 
specifically examines the hypermedia structure 
for subnetworks that match a given pattern. 

and goes on to give examples where such queries are useful. 

3.1 Structural Information Retrieval 

The first tools developed for querying the web were the well- 
known search engines which are now widely deployed and 
used. These are based on searching indices of words and 
phrases appearing in documents discovered by web "crawlers. 11 
More recently, there have been efforts to overcome the lim- 
itations of this paradigm by exploiting link structure in 
queries. For example. [Kle98], [BH98] and [CDRR98], pro- 
pose to use the web structure to analyze the many sites 
returned by a search engine as relevant to a topic in order 
to extract those that are likely to be authoritative sources 
on the topic. To support connectivity analysis for this and 
other applications, (such as efficient implementations of the 
query languages described below) the Connectivity Server 
[BBH + 98] provides fast access to structural information. 
Google [BP98], a prototype next-generation web search en- 
gine, makes heavy use of web structure to improve crawling 
and indexing performance. Other methods for exploiting 
link structure are presented in [PPR96, CK98], In these 
works, structural information is mostly used behind the scenes, 
to improve the answers to purely content-oriented queries. 
Spertus [Spe97] points out many useful applications of queries 
that take link structure into account explicitly. 

3.2 Related query paradigms 

In this section we briefly describe several families of query 
languages that were not developed specifically for querying 
the web. However, since the concepts on which they are 
based are similar in spirit to the web query languages we 
discuss, these languages can also be useful for web applica- 
tions. 

Hypertext/document query languages: A number of 
models and languages for querying structured documents 
and hypertexts were proposed in the pre- web era. For exam- 
ple. Abiteboul et al.[ACM93] and Christophides et al.(CACS94] 
map documents to object oriented database instances by 
means of semantic actions attached to a grammar. Then the 
database representation <ran be queried using the query lan- 
guage of the database. A novel aspect of this approach is the 
possibility of querying the structure by means of path vari- 



ables. Guting et al.[GZC89] model documents using nested 
ordered relations and use a generalization of nested rela- 
tional algebra as a query language. Been and Kornatzky 
[BK90] propose a logic whose formulas specify patterns over 
the hypertext graph. 

Graph query languages: Work in using graphs to model 
databases, motivated by applications such as software engi- 
neering and computer network management, led to the G, 
G-f and GraphLog graph-based languages [CMW87. CM VV88. 
CM90]. In particular, G and G+ are based on labeled 
graphs; they support regular path expressions and graph 
construction in queries. GraphLog, whose semantics is based 
on Datalog, was applied to Hypertext queries in [CM89]. 
Pared aens et al [PdBA + 92j developed a graph query lan- 
guage for object-oriented databases. 

Languages for querying semistructured data: Query 
languages for semistructured data such as Lore] [AQM+97], 
UnQL [BDHS96] and StruQL [FFLS97] also use labeled 
graphs as a flexible data model. In contrast to graph query 
languages, they emphasize the ability to query the schema 
of the data, and the ability to accommodate irregularities 
in the data, such as missing or repeated fields, heteroge- 
neous records. Related work in the OO community [Har94] 
proposes "schema-shy" models and queries to handle infor- 
mation about software engineering artifacts. 

These languages were not developed specifically for the web. 
and do not distinguish, for example, between graph edges 
that represent the connection between a document and one 
of its parts and edges that represent a hyperlink from one 
web document to another. Their data models, while elegant, 
are not very rich, lacking such basic comforts as records and 
ordered collections. 

3.3 First generation web query languages 

A first generation of web query languages aimed to combine 
the content-based queries of search engines with structure- 
based queries similar to what one would find in a database 
system. These languages, which include W3QL[KS95], Web- 
SQL [MMM97, AiMM97aJ, and WebLog [LSS96]. combine 
conditions on text patterns appearing within documents with 
graph patterns describing link structure. We use WebSQL 
as an example of the kinds of queries that can be asked. 

WebSQL WebSQL proposes to model the web as a rela- 
tional database composed of two (virtual) relations: Docu- 
ment and Anchor. The Document relation has one tuple for 
each document in the web and the Anchor relation has one 
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tuple for each anchor in each document in the web. This 
relational abstraction of the web allows us to use a query 
language similar to SQL to pose the queries. 

If Document and Anchor were actual relations* we could 
simply use SQL to write queries on them. But since the 
Document and Anchor relations are completely virtual and 
there is no way to enumerate them, we cannot operate on 
them directly. The WebSQL semantics depends instead on 
materializing portions of them by specifying the documents 
of interest in the FROM clause of a query. The basic way 
of materializing a portion of the web is by navigating from 
known URL's. Path regular expressions are used to describe 
this navigation. An atom of such a regular expression can 
be of the form dl »> d2, meaning document dl points to 
d2 and d2 is stored on a different server from dl; or dl -> 
d2, meaning dl points to d2 and d2 is stored on the same 
server as dl. 

For example, suppose we want to find a list of triples of the 
form (dl ,d2 , label) , where dl is a document stored on our 
local site, d2 is a document stored somewhere else, and dl 
points to d2 by a link labeled label. Assume all our local 
documents are reachable from vvv. my site, start. 

SELECT d.url,e.url,a.label 

FROM Document d SUCH THAT 

"vwv.mysite. start" ->* d, 
Document e SUCH THAT d «> e, 
Anchor a SUCH THAT a. base = d.url 

WHERE a. href * e.uxl 

The FROM clause instantiates two Document variables, d 
and e, and one Anchor variable o. The variable d is bound 
in turn to each local document reachable from the starting 
document , and e is bound to each non-local document reach- 
able directly from d. The anchor variable a is instantiated 
to each link that originates in document d; the extra condi- 
tion that the target of link a be document e is given in the 
WHERE clause. Another way of materializing part of the 
Document and Anchor relations is by content conditions: 
for example, if we were only interested in documents that 
contains the string "database" we could have added to the 
FROM clause the condition d MENTIONS "database". The 
implementation uses search engines to generate candidate 
documents that satisfy the MENTION conditions. 

Other Languages W3QL [KS95] is similar in flavour to 
WebSQL, with some notable differences: it uses external 
programs (similar to user defined functions in object-relational 
languages) for specifying content conditions on files rather 
than building conditions into the language syntax, and it 
provides mechanisms for handling forms encountered dur- 
ing navigation. In [KS98] t Konopnicki and Shmueli describe 
planned extensions to move W3QL into what we call* the 
second generation. These include modeling internal doc- 
ument structure, hierarchical web modeling that captures 
the notion of web site explicitly, and replacing the exter- 
nal program method of specifying conditions with a general 
extensible method based on the MIME standard. 

WebLog [LSS96] differs from the above languages in using 
deductive rules instead of SQL- like syntax (see the descrip- 
tion of Florid below). 

WQL, the query language of the VVebDB project [LSCH98]. 
is similar to WebSQL but it supports more comprehensive 
SQL functionality such as aggregation and grouping, and 
provides limited support for querying intra-document struc- 



ture, placing it closer to the class of languages discussed h 
the next subsection. 

3.4 Second generation: Web Data Manip- 
ulation Languages 

The languages above treat web pages as atomic objects witl 
two properties: they contain or do not contain certain tex 
patterns, and they point to other objects. Experience witl 
their use suggests there are two main areas of applicatioE 
that they can be useful for: data wrapping, transformation 
and restructuring, as described in Section 4; and web sici 
construction and restructuring, as described in Section 5. Ii 
both application areas, it is often essential to have access t< 
the internal structure of web pages from the query language 
if we want declarative queries to capture a large part of tht 
task at hand. For example, the task of extracting a set o 
tuples from the HTML pages of the Internet Movie Database 
requires parsing the HTML and selectively accessing certaii 
subtrees in the parse tree. 

In this section we describe the second-generation of wel 
query languages that we call "Web data manipulation lan- 
guages." These languages go beyond the first generatior 
languages in two significant ways. First, they provide acces: 
to the structure of the web objects that they manipulate 
Unlike the first-generation languages, they model interna 
structure of web documents as well as the external link: 
that connect them. They support references to model hyper- 
links, and some support ordered collections and records foi 
more natural data representation. Second, these languages 
provide the ability to create new complex structures as c 
result of a query. Since the data on the web is commonly 
semistructured (or worse), these languages still emphasise 
the ability to support semistructured features. We brieflj 
describe three languages in this class: WebOQL [AM98] 
StruQL[FFLS97] and Florid [HLLS97]. 

WebOQL 

The main data structure provided by WebOQL is the hy* 
pertree. Hypertrees are ordered arc-labeled trees with twc 
types of arcs, internal and external. Internal arcs are usee 
to represent structured objects and external arcs are usee 
to represent references (typically hyperlinks) among objects 
Arcs are labeled with records. Figure 1. from [Aro97], show.* 
a hypertree containing descriptions of publications from sev- 
eral research groups. Such a tree could easily be built, foi 
example, from an HTML file, using a generic HTML wrap- 
per. 

Sets of related hypertrees are collected into tvebs. Both hy- 
pertrees and webs can be manipulated using WebOQL and 
created as the result of a query. 

WebOQL is a functional language, but queries are couched 
in the familiar select-from- where form. For example, sup- 
pose that the name csPapers denotes the papers database 
in Figure 1, and that we want to extract from it the title 
and URL of the full version of papers authored by "Smith". 

select [y. Title, y ' .UrlD 
from x in csPapers * y in x' 
where y. Authors " "Smith'' 

In this query, x iterates over the simple trees of csPapers 
{i.e., over the research groups) and, given a value for x, y 
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Figure 1: Example of a hypertree 



iterates over the simple trees of x\ The primed variable x' 
denotes the result of applying to tree x the Prime opera- 
tor, which returns the first subtree of its argument. The 
same operator is used to extract from tree y its first subtree 
in y'. Url. The square brackets denote the Hang operator, 
which builds an arc labeled with a record formed with the ar- 
guments (in this example, the field names are inferred.) Fi- 
nally, the tilde represents the string pattern matching pred- 
icate: its left argument is a string and its right argument is 
a pattern. 

Web Creation The query above maps a h.vpertree into 
another hypertree; more generally, a query is a function that 
maps a web into another. For example, the following query 
creates a new page for each research group (using the group 
name as URL). Each page contains the publications of the 
corresponding group. 

select x' as x. Group 
from x in csPapers 

In general, the select clause has the form * select gi as $1, 
y 3 as 32, ... , Qm as s m \ where the ov's are queries and each 
of the Si's is either a string query or the keyword schema. 

The **as w clauses create the URL's ij, $ 2 «m, which 

are assigned to the new pages resulting from each query 

Navigation Patterns Navigation patterns are regular ex- 
pressions over an alphabet of record predicates: they allow 
us to specify the structure of the paths that must be followed 
in order to find the instances for variables. 

Navigation patterns are mainly useful for two purposes. The 
first reason is for extracting subtrees from trees whose struc- 
ture we do not know in detail or whose structure presents 
irregularities, and the second is for iterating over trees con- 
nected by external arcs. In fact, the distinction between in- 
ternal and external arcs in hypertrees becomes really useful 
when we use navigation patterns that traverse external arcs. 
Suppose that we have a software product whose documen- 
tation is provided in HTML format and we want to build 
a full- text index for it. These documents form a complex 



hypertext, but it is possible to browse them sequentially by 
following links having the string "Next" as label. For build- 
ing the full-text index we need to feed the indexer with the 
text and the URL of each document. We can obtain this 
information using the following query: 

select [ x.Url, x.Text ] 
from x in browse ("root .html") 

via (-*[Text - "Mext M 3>)* 

StruQL 

STRUQL is the query language of the STRUDEL web site 
management system, described below in Section 5. Even 
though StruQL was developed in the context of a specific 
web application, it is a general purpose query language for 
semistructured data, based on a data model of labeled di- 
rected graphs. In addition, the Strudel data model in* 
eludes named collections, and supports several atomic types 
that commonly appear in web pages, such as URLs, and 
Postscript, text, image, and HTML files. The result of a 
StruQL query is a graph in the same data model as the 
input graphs. In Strudel, StruQL was used for two tasks: 
querying heterogeneous sources to integrate them into a site 
data graph, and for querying this data graph to produce a 
site graph. 

A StruQL query is a set of possibly nested blocks, each of 
the form: 

[vhere CI , . . . ,Ck] 

[create Nl Mn] 

Clink L.1 , . . . ,Lp] 
Ccollect Gl Gq3. 

The where clause can include either membership conditions 
or conditions on pairs of nodes expressed using regular path 
expressions. The vhere clause produces all bindings of node 
and arc variables to values in the input graph, and the re- 
maining clauses use Skolem functions to construct a new 
graph from these bindings. 
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We illustrate STRUQL with a query defining a web site, 
starting with a Bibtex bibliography file, modeled as a labeled 
graph. The web site will consist of three kinds of pages: a 
PaperPresentation page for each bibliography entry, a Year 
page for each year, pointing to all papers published in that 
year, and a Root page pointing to all the Year pages. After 
showing the query in StruQL, we show it in WebOQL to 
give a feel for the differences between the languages. 

// Create Root 
create RootPageO 

// Create a presentation for every publication x 
where Publications (x) , x~>l->v 
create PaperPresentation (x) 
link PaperPresentation (x) -> 1 -> v 
{ // Create a page for every year 

where 1 = "year" 

create YearPage (v) 

link 

YearPage (v) -> "Year" -> v 

YearPage (v) -> "Paper *'->PaperPres entat ion (x) , 
// Link root page to each year page 
RootPageO -> "YearPage" -> YearPage (v) 



In the where clause, the notation Publications (x) means 
that x belongs to the collection Publications, and the atom 
x — *> 1 v denotes that there is a link in the graph from 
x to v and the label on the arc is 1. The same notation is 
used in the link clause to specify the newly created edges in 
the resulting graph. After creating the Root page, the first 
CREATE generates a page for each publication (denoted 
by the Skolem function, PaperPresentation). The second 
CREATE, nested within the outer query, generates a Year 
page for each year, and links it to the Root page and to 
the PaperPresentation pages of the publications published 
in that year. Note the Skolem function YearPage ensures 
that a Year page for a particular year is only created once, 
no matter how many papers were published in that year. 

Below is the same query in WebOQL. 

select unique [Url; x.year, Label: "YearPage "] 

as "RootPage" , 
C label: "Paper" / x ] as x.year 
from x in browse ( "bibtex : ayf ile.bib") 



select [year: y.url] + y as y.url 
from y in "browse (RootPage) " 

The WebOQL query consists of two subqueries, with the 
web resulting from the first one "piped" into the second one 
using the a |" operator. The first subquery builds the Root, 
Paper, and Year pages, and the second one redefines each 
Year page by adding the ''year* field to it. 

Florid 

Florid {HLLS97. LHL + 98] is a prototype implementation of 
the deductive and object-oriented formalism F-logic [KLW95], 
To use Florid as a web query engine, a web document is 
modeled by the following two classes: 

url : :stringCget «> webdoc] . 



vebdoc: : string [url «> url; author => string; 

znodif => string; 
type «=> string; href sCC string) =» url; 

error £ =» string] . 

The first declaration introduces a class url. subclass of string 
with the only method get. The notation get »> vebdoc 
means that get is a single-valued method that returns an 
object of type webdoc. The method get is system-defined; 
the effect of invoking u.get for a url u in the head of a de- 
ductive rule is to retrieve from the web the document with 
that URL and cache it in the local Florid database as a 
vebdoc object with object identifier u.get. 

The class vebdoc with methods self , author , modif , type » 
href s and error models the basic information common to 
all web documents. The notation href sC (string) =>> url 
means that the multi-valued method href s takes a string 
as argument and returns a set of objects of type url. The 
idea is that, if d is a webdoc, then d. href sC(aLabel) returns 
all URL*s of documents pointed to by links labeled aLabel 
within document d. 

Subclasses of documents can be declared as needed using 
F-logic inheritance, e.g.: 

htmldoc: : webdoc [t it le string; text «> string]. 

Computation in FLORID is expressed by sets of deductive 
rules. For example, the program below extracts from the 
web the set of all documents reachable directly or indirectly 
from the URL vvw.cs.toronto.edu by links whose labels 
contain the string "database.** 

("www. cs.toronto.edu": url) .get. 
(Y:url).get <- 

(X:url) .get [href 6«<L)=»<Y>D , 

substr( "database", L) . 

FLORED provides a powerful formalism for manipulating semi- 
structured data in a web context. However, it does not cur- 
rently support the construction of new webs as results of 
computation; the result is always a set of F-logic objects in 
the local store. 



Ulixes and Penelope 

In the AraNEUS project [AMM97b], the query and restruc- 
turing process is split into two phases. In the first phase, the 
Ulixes language is used to build relational views over the 
web. These views can then be analyzed and integrated using 
standard database techniques. ULIXES queries extract rela- 
tional data from instances of page schemes defined in the 
ADM model, making heavy use of (star-free) path expres- 
sions. The second phase consists of generating hypertextual 
views of the data using the Penelope language. Query op- 
timization for relational views over sets of web pages, such 
as those constructed by Ulixes, is discussed in [MMM98], 

Interactive query interfaces 

All the languages in the previous two subsections are too 
complex to be used directly by interactive users, just as SQL 
is; like SQL, they are meant to be used mostly as program- 
ming tools. There has however been work in the design of 
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interactive query interfaces suitable for casual users. For ex- 
ample, Dataguides [GW97] is an interactive query tool for 
semistructurecl databases based on hierarchical summaries 
of the data graph; extensions to support querying single 
complex web sites are described in [GW98]. The system de- 
scribed in [HML + 98] supports queries that combine multi- 
media features, such as similarity to a given sketch or image, 
textual features such as keywords, and domain semantics. 

Theory of web queries 

In defining the semantics of first-generation web query lan- 
guages, it was immediately observed that certain easily stated 
queries, such as "list all web documents that no other doc- 
ument points to," could be rather hard to execute. This 
leads naturally to questions of query computability in this 
context. Abiteboul and Vianu [AV97a] and Mendelzon and 
Milo [MM97] propose formal ways of categorizing web queries 
according to whether they can in principle be computed or 
not; the key idea being that essentially, the only possible way 
to access the web is to navigate links from known starting 
points. (Note this includes a special case navigating links 
from the large collections of starting points known as index 
servers or search engines.) Abiteboul and Vianu [AV97b] 
also discuss fundamental issues posed by query optimiza- 
tion in path traversal queries. Mihaila, Milo and Mendel- 
zon [MMM97] show how to analyze WebSQL queries in 
terms of the maximum number of web sites. Florescu, Levy 
and Suciu [FLS98] describe an algorithm for query contain- 
ment for queries with regular path expressions, which is then 
used for verifying integrity constraints on the structure of 
web sites [FFLS98]. 

4 Information Integration 

As stated earlier, the WWW contains a growing number of 
information sources that can be viewed as containers of sets 
of tuples. These "tuples* can either be embedded in HTML 
pages, or be hidden behind form interfaces. By writing spe- 
cialized programs called wrappers, one can give the illusion 
that the web site is serving sets of tuples. We refer to the 
combination of the underlying web site and the wrapper as- 
sociated with it as a web source. 

The task of a web information integration system is to an- 
swer queries that may require extracting and combining data 
from multiple web sources. As an example, consider the do- 
main of movies. The Internet Movie Database contains com- 
prehensive data about movies, their casts, genres and direc- 
tors. Reviews of movies can be found in multiple other web 
sources (e.g., web sites of major newspapers), and several 
web sources provide schedules of movie showings. By com- 
bining the data from these sources we can answer queries 
such as: give me a movie, playing time and a review of 
movies starring Frank Sinatra, playing tonight in Paris, 

Several systems have been built with the goal of answering 
queries using a multitude of web sources [GMPQ+97, EW94, 
WBJ+95, LR096, FW97, DG97b, AKS96, Coh98 t AAB+98, 
BEM + 98]. Many of the problems encountered in build- 
ing these systems are similar to those addressed in building 
heterogeneous database systems [ACPS96, WAC + 93, HZ96, 
TRV98, FRV96, Bla96. HKWY97]. Web data integration 
systems have, in addition, to deal with (1) large and evolv- 
ing number of web sources, (2) little meta-data about the 
characteristics of the source, and (3) larger degree of source 



autonomy. 

An important distinction in building data integration sys- 
tems, and therefore in building web data integration sys- 
tems, is whether to take a warehousing or a virtual approach 
(see [HZ96, Hul97] for a comparison). In the warehousing 
approach, data from multiple web sources is loaded into a 
warehouse, and all queries are applied to the warehoused 
data; this requires that the warehouse be updated when 
data changes, but the advantage is that adequate perfor- 
mance can be guaranteed at query time. In the virtual ap- 
proach, the data remains in the web sources, and queries 
to the data integration system are decomposed at run time 
into queries on the sources. In this approach, data is not 
replicated, and is guaranteed to be fresh at query time. On 
the other hand, because the web sources are autonomous, 
more sophisticated query optimization and execution meth- 
ods are needed to guarantee adequate performance. The 
virtual approach is more appropriate for building systems 
where the number of sources is large, the data is changing 
frequently, and there is little control over the web sources. 
For these reasons, most of the recent research has focused 
on the virtual approach, and therefore, so will our discus- 
sion. We emphasize that many of the issues that arise in 
the virtual approach also arise in the warehousing approach 
(often in a slightly different form), and hence our discussion 
is relevant to both cases. Finally, we refer the reader to two 
commercial applications of web data integration, one that 
takes the warehousing approach [Jun98] and the other that 
takes the virtual approach [Jan98]. 

A prototypical architecture of a virtual data integration sys- 
tem is shown in Figure 3. There are two main features distin- 
guishing such a system from a traditional database system: 

• As stated earlier, the system does not communicate di- 
rectly with a local storage manager. Instead, in order 
to obtain data, the query execution engine communi- 
cates with a set of wrappers. A wrapper is a program 
which is specific to every web site, and whose task is 
to translate the data in the web site to a form that can 
be further processed by the data integration system. 
For example, the wrapper may extract from an HTML 
file a set of tuples. It should be emphasized that the 
wrapper provides only an interface to the data served 
by the web site, and hence, if the web site provides 
only limited access to the data (e.g., through a form 
interface that requires certain inputs), then the wrap- 
per can only support limited access patterns to the 
data. 

• The second difference from traditional systems is that 
the user does not pose queries directly in the schema in 
which the data is stored. The reason for this is that one 
of the principal goals of a data integration system is to 
free the user from having to know about the specific 
data sources and interact with each one. Instead, the 
user poses queries on a mediated schema, A mediated 
schema is a set of virtual relations, which are designed 
for a particular data integration application. The re- 
lations in the mediated schema are not actually stored 
anywhere. As a consequence, the data integration sys- 
tem must first reformulate a user query into a query 
that refers directly to the schemas in the sources, fn 
order to perform the reformulation step, the data in- 
tegration system requires a set of source descriptions, 
A description of an information source specifies the 
contents of the source (e.g., contains movies), the at- 
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Figure 2: Architecture of a data integration system 



tributes that can be found in the source (e.g., genre, 
cast), constraints on the concents of the source (e.g., 
contains only American movies i. completeness and re- 
liability of the source, and finally, the query processing 
capabilities of the source (e.g.. can perform selections, 
or can answer arbitrary SQL queries). 

The following are the main issues addressed in the work on 
building web data integration systems. 

Specification of mediated schema and reformulation: 
The mediated schema in a data integration system is the 
set of collection and attribute names chat are used to for- 
mulate queries. To evaluate a query, the data integration 
system must translate the query on the mediated schema 
into a query on the data sources, that have their own lo- 
cal schemas. In order to do so, the system requires a set of 
source descriptions. Several recent research works addressed 
the problem of how to specify source descriptions and how 
to use them for query reformulation. Broadly speaking, 
two general approaches have been proposed: Global as view 
(GAV) [GMPQ+97, PAGM96, ACPS96. HKWY97, FRV96, 
TRV98) and Local as view (LAV) [LR096, KW96, DG97a t 
DG97b, FW97] (see [U1197] for a detailed comparison). 

In the GAV approach, for each relation R in the medi- 
ated schema, we write a query over the source relations 
specifying how to obtain R y s tuples from the sources. The 
LAV approach takes the opposite approach. For every in- 
formation source S, we write a query over the relations in 
the mediated schema that describes which tuples are found 
in S. The main advantage of the GAV approach is that 
query reformulation is very simple, because- it reduces to 
view unfolding. In contrast, in the LAV approach it is sim- 
pler to add or delete sources because the descriptions of 
the sources do not have to take into account the possible 
interactions with other sources, as in the GAV approach, 
and it is also easier to describe constraints on the con- 
tents of sources. The problem of reformulation becomes a 
variant on the of the problem of answering queries using 
views [YL87, TSI96, LMSS95, CKPS95. RSU95, DG97b], 

Completeness of data in web sources: In general, sources 
that we find on the WWW are not necessarily complete for 



the domain they are covering. For example, a bibliography 
source is unlikely to be complete for the field of Computer 
Science. However, in some cases, we can assert complete- 
ness statements about sources. For example, the DB&LP 
Database 2 has the complete set of papers published in most 
major database conferences. Knowledge of completeness of 
a web source can help a data integration system in several 
ways. Most importantly, since a negative answer from a 
complete source is meaningful, the data integration system 
can prune access to other sources. The problem of describ- 
ing completeness of web sources and using this information 
for query processing is addressed in {Mot89, EGW94, Lev96. 
Dus97, AD98, FW97]. The work described in [FKL97] de- 
scribes a probabilistic formalism for describing the contents 
and overlaps among information sources, and presents algo- 
rithms for choosing optimally between sources. 

Differing query processing capabilities: From the per- 
spective of the web data integration system, the web sources 
appear to have vastly differing query processing capabilities. 
The main reasons for the different appearance are (1) the 
underlying data may actually be stored in a structured file 
or legacy system and in this case the interface to this data 
is naturally limited, and (2) even if the data is stored in a 
traditional database system, the web site may provide only 
limited access capabilities for reasons of security or perfor- 
mance. 

To build an effective data integration system, these capabili- 
ties need to be explicitly described to the system, adhered to, 
and exploited as much as possible to improve performance. 
We distinguish two types of capabilities: negative capabili- 
ties that limit, the access patterns to the. data, and. positive 
capabilities, where a source is able to perform additional 
algebraic operations in addition to simple data fetches. 

The main form of negative capabilities is limitations on the 
binding patterns that can be used in queries sent to the 
source. For example, it is not possible to send a query to 
the Internet Movie Database asking for all the movies in the 
database and their casts. Instead, it is only possible to ask 
for the cast of given movie, or to ask for the set of movies in 
which a particular actor appears. Several works have con- 

a http ://•«» . lnf orn&t ik . uni-tri*r- d»/ l«y/db/ 
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sidered the problem of answering queries in the presence of 
binding pattern limitations [RSU95, KW96, LR096, FW97]. 

Positive capabilities pose another challenge to a data inte- 
gration system. If a data source has the ability to perform 
operations such as selections and joins, we would like to push 
as much as possible of the processing to the source, thereby 
hopefully reducing the amount of local processing and the 
amount of data transmitted over the network. The problem 
of describing the computing capabilities of data sources and 
exploiting them to create query execution plans is consid- 
ered in [PGGMU95, TRV98. LRU96, HKWY97, VP97a] 

Query optimization: Many works on web data integra- 
tion systems have focused on the problem of selecting a min- 
ima/ set of web sources to access, and on determining the 
minima] query that needs to be sent to each one. However, 
the issue of choosing an optimal query execution plan to ac- 
cess the web sources has received relatively little attention 
in the data integration literature [HKWY97], and remains 
an active area of research. The additional challenges that 
are faced in query optimization over sources on the. WWW 
is that we have few statistics on the data in the sources, and 
hence little information to evaluate the cost of query exe- 
cution plans. The work in [NGT98] considers the problem 
of calibrating the cost model for query execution plans in 
this context. The work in [YPAGM98] discusses the prob- 
lem of query optimization for fusion queries, which are a 
special class of integration queries that focus on retrieving 
various attributes of a given object from multiple sources. 
Moreover, we believe that query processing in data integra- 
tion systems is one area which would benefit from ideas such 
as interleaving of planning and execution and of computing 
conditional plans [GC94, KD98]. 

Query execution engines: Even less attention has been 
paid to the problem of building query execution engines tar- 
geted for web data integration. The challenges in building 
such engines are caused by the autonomy of the data sources 
and the unpredictability of the performance of the network. 
In particular, when accessing web sources we may experience 
initial delays before data is transmitted, and even when it is. 
the arrival of the data may be bursty. The work described 
in [AFT98. UFA98] has considered the problem of adapting 
a query execution plans to initial delays in the arrival of the 
data. 

Wrapper construction: Recall that the role of a wrapper 
is to extract the data out of a web site into a form that can 
be manipulated by the data integration system. For exam- 
ple, the task of a wrapper could be to pose a query to a 
web source using a form interface, and to extract a set of 
answer tuples out of the resulting HTML page. The diffi- 
culty in building wrappers is that the HTML page is usu- 
ally designed for human viewing, rather than for extracting 
data by programs. Hence, the data is often embedded in 
natural language text or hidden within graphical presenta- 
tion primitives. Moreover, the form of the HTML pages 
changes frequently, making it hard to maintain the wrap- 
pers. Several works have considered the problem of build- 
ing tools for rapid creation of wrappers. One class of tools 
(e.g., [HGMN + 98, GRVB98J) is based on developing spe- 
cialized grammars for specifying how the data is laid out in 
an HTML page, and therefore how to extract the required 
data. A second class of techniques is based on developing 
inductive learning techniques for automatically learning a 
wrapper. Using these algorithms, we provide the system 



with a set of HTML pages where the data in the page is la- 
beled. The algorithm uses the labeled examples to automati- 
cally output a grammar by which the data can be extracted 
from subsequent pages. Naturally, the more examples we 
give the system, the more accurate the resulting grammar 
can be, and the challenge is to discover wrapper languages 
that can be learned with a small. number of examples. The 
first formulation of wrapper construction as inductive learn- 
ing and a set of algorithms for learning simple classes of 
wrappers are given in [KDVV97]. The algorithm described 
in [AK97] exploits heuristics specific to the common uses of 
HTML in order to obtain faster learning. It should be noted 
that Machine Learning methods have also been used to learn 
the mapping between the source schemas and the mediated 
schemas [PE95, DEW97], The work described [CDF+98] is 
a first step in bridging the gap between the approaches of 
Machine Learning and of Natural Language Processing to 
the problem of wrapper construction. Finally, we note that 
the emergence of XML may lead web site builders to export 
the data underlying their sites in a machine readable form, 
thereby greatly simplifying the construction of wrappers. 

Matching objects across sources: One of the hardest 
problems in answering queries over a multitude of sources is 
deciding that two objects mentioned in two different sources 
refer to the same entity in the world. This problem arises 
because each source employs its own naming conventions 
and shorthands. Most systems deal with this problem using 
domain specific heuristics (as in [FHM94]). In the WHIRL 
system [Coh98], matching of objects across sources is done 
by using techniques from Information Retrieval. Further- 
more, the matching of the objects is elegantly integrated in 
a novel query execution algorithm. 

5 Web site construction and restructuring 

The previous two sections discussed tasks that concerned 
querying existing web sites and their content. However, 
given the fact that web sites essentially provide access to 
complex structures of information, it is natural to apply 
techniques from Database Systems to the process of building 
and maintaining web sites. One can -distinguish two general 
classes of web site building tasks: one in which web sites are 
created from a collection of underlying data sources, and 
another in which they are created by restructuring existing 
web sites. As it turns out. the same techniques are required 
for both of these classes. Furthermore, we note that the task 
of providing a web interface to data that exists in a single 
database system [NS96] is a simple instance of the problem 
of creating web sites. 3 

To understand the problem of building web sites and the 
possible import of database technology, consider the tasks 
that a web site builder must address: (1) choosing and ac- 
cessing the data that will be displayed at the site, (2) design- 
ing the site's-structure, i.e.. specifying the data contained - 
within each page and the links between pages, and (3) de- 
signing the graphical presentation of pages. In existing web 
site management tools, these tasks are, for the most part, in- 
terdependent. Without any site-creation tools, a site builder 
writes HTML files by hand or writes programs to produce 
them and must focus simultaneously on a page's content, its 
relationship to other pages, and its graphical presentation. 
As a result, several important tasks, such as automatically 

3 Most database vendors were quick to provide commercial tools 
for performing this task. 
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updating a site, restructuring a site, or enforcing integrity 
constraints on a site's structure, are tedious to perform. 

Web sites as declaratively defined structures: Sev- 
eral systems have been developed with the goal of apply* 
ing database techniques to the problem of web site cre- 
ation [FFK+98, AMM98, AM98. CDSS98. PF98, JB97. LSB + 98. 
TN98]. The common theme to these systems is that they 
provide an explicit declarative representation of the struc- 
ture of a web site. The structure of the web site is denned 
as a view over existing data. However, we emphasize that 
the languages used to create these views result in graphs of 
web pages with hypertext links, rather than simple tables. 
The systems differ on the data model they use, the query 
language they use, and whether they have an intermediate 
logical representation of the web site, rather than having 
only a representation of the final HTML. 

Building a web site using a declarative representation of the 
structure of the site has several significant advantages. Since 
a web site's structure and content are defined declaratively 
by a query, not procedurally by a program, it is easy to cre- 
ate multiple versions of a site. For example, it is possible 
to easily build internal and external views of an organiza- 
tion's site or to build sites tailored to different classes of 
users. Currently, creating multiple versions requires writ- 
ing multiple sets of programs or manually creating different 
sets of HTML files. Building multiple versions of a site can 
be done by either writing different site definition queries, or 
by changing the graphical representation independently of 
the underlying structure. Furthermore, a declarative rep- 
resentation of the web site's structure also supports easy 
evolution of a web site's structure. For example, to reorga- 
nize pages based on frequent usage patterns [PE97], or to 
extend the site's content, simply rewrite the site-definition 
query, as opposed to rewriting a set of programs or a set of 
HTML files. Declarative specification of web sites can offer 
other advantages. For example, it becomes possible to ex- 
press and enforce integrity constraints on the site [FFLS98], 
and to update a site incrementally when changes occur in 
the underlying data. Moreover, a declarative specification 
provides a platform for developing optimization algorithms 
for run-time management of data intensive web sitesV'The 
challenge in run-time management of a web site is to auto- 
matically find an optimal tradeoff between precomputation 
of parts of the web site and click- time computation. Finally, 
we remark that building web sites using this paradigm will 
also facilitate the tasks of querying web sites and integrating 
data from multiple web sources. 

A prototypical architecture of such a system is shown in 
Figure 3. At the bottom level, the system accesses a set of 
data sources containing the data that will be served on the 
web site. The data may be stored in databases, in structured 
files, or in existing web sites. The data is represented in 
the system in some data model, and the system provides 
■a uniform ' interface to* these data' sources using techniques 
similar to the ones described in the previous section. The 
main step in building a web site is to write an expression that 
declaratively represents the structure of the web site. The 
expression is written in a specific query language provided 
by the system. The result of applying this query to the 
underlying data is the logical representation of the web site 
in the data model of the system (e.g., a labeled directed 
graph). Finally, to actually create a browsable web site, 
the system contains a method (e.g., HTML templates) for 
translating the logical structure into a set of HTML files. 
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Figure 3: Architecture for Web Site Management Systems 



Some of the salient characteristics of the different systems 
are the following. STRUDEL [FFK + 98] uses a semistructured 
data model of labeled directed graphs for modeling both the 
underlying data and for modeling the web site. It uses a sin- 
gle query language, SmuQL, throughout the system, both 
for integrating the raw data and for denning the structure of 
the web site. ARANEUS [AMM97b] uses a more structured 
data model, ADM, and provides a language for transform- 
ing data into ADM and a language for creating web sites 
from data modeled in ADM. In addition, Araneus uses web- 
specific constructs in the data model. The Autoweb [PF9S] 
system is based on the hypermedia design model (HDM), a 
design tool for hypermedia applications. Its data model is 
based on the entity-relationship model; the "access schema" 
specifies how the hyperbase is navigated and accessed in 
a browsable site; and the "presentation schema" specifies 
how objects and paths in the hyperbase and access schemas 
are rendered. All three systems mentioned above provide a 
clear separation between the creation of the logical struc- 
ture of the web site and the specification of the graphical 
presentation of the site. The YAT system [CDSS98] is an 
application of a data conversion language to the problem of 
building web sites. Using YAT, a web site designer writes 
a set of rules converting from the raw data into an abstract 
syntax tree of the resulting HTML, without going through 
an intermediate logical representation phase. In fact, in 
a similar way, other languages for data conversation (such 
as [M298, MPP+93, PMSL94]) can also be used to build 
web sites. The WIRM system [JB97] is similar in spirit to 
the above systems in that it enables users to build web sites 
in which in which the pages can be viewed context-sensitive 
views of the underlying data; The major focus of WIRM is - 
on integrating medical research data for the national Human 
Brain Project. 

6 Conclusions, Perspectives and Future Work 

An overarching question regarding the topic of this survey 
is whether the World-Wide Web presents novel problems 
to the database community. In many ways, the WWW is 
not similar to a database. For example, there is no uni- 
form structure, no integrity constraints, no transactions, no 
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standard query language or data model. And yet, as the 
survey has showed, the powerful abstractions developed in 
the database community may prove to be key in taming the 
web's complexity and providing valuable services. 

Of particular importance is the view of a large web site as 
being not just a database, but an information system built 
around one or more databases with an accompanying com- 
plex navigation structure. In that view, a web site has many 
similarities to non-web information systems. Designing such 
a web site requires extending information systems design 
methodologies [AMM98, PF98]. Using these principles to 
build web sites will also impact the way we query the web 
and the way we integrate data from multiple web sources. 

Several trends will have significant impact on the use of 
database technology for web applications. The first is, of 
course, XML. The considerable momentum behind XML 
and related metadata initiatives can only help the appli- 
cability of database concepts to the web by providing the 
much needed structure in a widely accepted format. While 
the availability of data in XML format will reduce the need 
to focus on wrappers converting human readable data to ma- 
chine readable data, the challenges of semantic integration 
of data from web sources still remains. Building on our ex- 
perience in developing methods for manipulating semistruc- 
tured data, our community is in a unique position to develop 
tools for manipulating data in XML format. In fact, some 
of the concepts developed in this community are already 
being adapted to the XML context [DFF+98, GMW98], 
Other projects under way in the database community in the 
area of metadata architectures and languages (e.g. [MRT98, 
KMSS98]) are likely to take advantage of and merge with 
the XML framework. 

A second trend that will affect the applicability of database 
techniques for querying the web is the growth of the so-called 
hidden web. The hidden web refers to the web pages that 
are generated by programs given user inputs, and are there- 
fore not accessible to web crawlers for indexing. A recent 
article [LG98] claims that close to 80% of the web is already 
in the hidden web. If our tools are to be able to benefit 
from data in the hidden web, we must develop techniques 
for identifying sites that generate web pages, classify them 
and automatically create query interfaces to them. 

There is no shortage in possible directions for future research 
in this area. In the past, the bulk of the work has focused on 
the logical level, developing appropriate data models, query 
languages and methods for describing different aspects of 
web sources. In contrast, problems of query optimization 
and query execution have received relatively little attention 
in the database community, and pose some of the more im- 
portant challenges for future work. Some of the important 
directions in which to enrich our data models and query lan- 
guages include the incorporation of various forms of meta 
data about sources (e.g., probabilistic information) and the 
principled combination of -querying structured and unstruc- 
tured data sources on the WWW. 

Finally, in this article we tried to provide a representative 
list of references on the topic of web and databases. In 
addition to these references, readers can get a more detailed 
account from recent workshops related to the topic of the 
survey [SSD97, W e b98. AII98]. 
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